Yi Yang is a professor with the Faculty of Engineering and Information Technology, University of Technology Sydney (UTS). He received the PhD degree in Computer Science from Zhejiang University in 2010. He was a postdoc researcher at the School of Computer Science, Carnegie Mellon University before he came to Australia. See more information about our lab at http://reler.net/.
Can supervise: YES
Dong, X, Yan, Y, Tan, M, Yang, Y & Tsang, IW 2019, 'Late Fusion via Subspace Search With Consistency Preservation', IEEE TRANSACTIONS ON IMAGE PROCESSING, vol. 28, no. 1, pp. 518-528.View/Download from: UTS OPUS or Publisher's site
Yan, Y, Tan, M, Tsang, I, Yang, Y, Shi, Q & Zhang, C 2019, 'Fast and Low Memory Cost Matrix Factorization: Algorithm, Analysis and Case Study', IEEE Transactions on Knowledge and Data Engineering.View/Download from: UTS OPUS or Publisher's site
IEEE Matrix factorization has been widely applied to various applications. With the fast development of storage and internet technologies, we have been witnessing a rapid increase of data. In this paper, we propose new algorithms for matrix factorization with the emphasis on efficiency. In addition, most existing methods of matrix factorization only consider a general smooth least square loss. Differently, many real-world applications have distinctive characteristics. As a result, different losses should be used accordingly. Therefore, it is beneficial to design new matrix factorization algorithms that are able to deal with both smooth and non-smooth losses. To this end, one needs to analyze the characteristics of target data and use the most appropriate loss based on the analysis. We particularly study two representative cases of low-rank matrix recovery, i.e., collaborative filtering for recommendation and high dynamic range imaging. To solve these two problems, we respectively propose a stage-wise matrix factorization algorithm by exploiting manifold optimization techniques. From our theoretical analysis, they are both are provably guaranteed to converge to a stationary point. Extensive experiments on recommender systems and high dynamic range imaging demonstrate the satisfactory performance and efficiency of our proposed method on large-scale real data.
Dong, X, Zheng, L, Ma, F, Yang, Y & Meng, D 2019, 'Few-Example Object Detection with Model Communication', IEEE Transactions on Pattern Analysis and Machine Intelligence.View/Download from: Publisher's site
IEEE In this paper, we study object detection using a large pool of unlabeled images and only a few labeled images per category, named "few-example object detection". The key challenge consists in generating trustworthy training samples as many as possible from the pool. Using few training examples as seeds, our method iterates between model training and high-confidence sample selection. In training, easy samples are generated first and, then the poorly initialized model undergoes improvement. As the model becomes more discriminative, challenging but reliable samples are selected. After that, another round of model improvement takes place. To further improve the precision and recall of the generated training samples, we embed multiple detection models in our framework, which has proven to outperform the single model baseline and the model ensemble method. Experiments on PASCAL VOC'07, MS COCO'14, and ILSVRC'13 indicate that by using as few as three or four samples selected for each category, our method produces very competitive results when compared to the state-of-the-art weakly-supervised approaches using a large number of image-level labels.
Du, X, Nie, F, Wang, W, Yang, Y & Zhou, X 2019, 'Exploiting Combination Effect for Unsupervised Feature Selection by l2,0 Norm.', IEEE transactions on neural networks and learning systems, vol. 30, no. 1, pp. 201-214.View/Download from: UTS OPUS or Publisher's site
In learning applications, exploring the cluster structures of the high dimensional data is an important task. It requires projecting or visualizing the cluster structures into a low dimensional space. The challenges are: 1) how to perform the projection or visualization with less information loss and 2) how to preserve the interpretability of the original data. Recent methods address these challenges simultaneously by unsupervised feature selection. They learn the cluster indicators based on the k nearest neighbor similarity graph, then select the features highly correlated with these indicators. Under this direction, many techniques, such as local discriminative analysis, nonnegative spectral analysis, nonnegative matrix factorization, etc., have been successfully introduced to make the selection more accurate. In this paper, we focus on enhancing the unsupervised feature selection in another perspective, namely, making the selection exploit the combination effect of the features. Given the expected feature amount, previous works operate on the whole features then select those of high coefficients one by one as the output. Our proposed method, instead, operates on a group of features initially then update the selection when a better group appears. Compared to the previous methods, the proposed method exploits the combination effect of the features by l2,0 norm. It improves the selection accuracy where the cluster structures are strongly related to a group of features. We conduct the experiments on six open access data sets from different domains. The experimental results show that our proposed method is more accurate than the recent methods which do not specially consider the combination effect of the features.
Du, X, Yin, H, Chen, L, Wang, Y, Yang, Y & Zhou, X 2019, 'Personalized Video Recommendation Using Rich Contents from Videos', IEEE Transactions on Knowledge and Data Engineering.View/Download from: UTS OPUS or Publisher's site
IEEE Video recommendation has become an essential way of helping people explore the massive videos and discover the ones that may be of interest to them. In the existing video recommender systems, the models make the recommendations based on the user-video interactions and single specific content features. When the specific content features are unavailable, the performance of the existing models will seriously deteriorate. Inspired by the fact that rich contents (e.g., text, audio, motion, and so on) exist in videos, in this paper, we explore how to use these rich contents to overcome the limitations caused by the unavailability of the specific ones. Specifically, we propose a novel general framework that incorporates arbitrary single content feature with user-video interactions, named as collaborative embedding regression (CER) model, to make effective video recommendation in both in-matrix and out-of-matrix scenarios. Our extensive experiments on two real-world large-scale datasets show that CER beats the existing recommender models with any single content feature and is more time efficient. In addition, we propose a priority-based late fusion (PRI) method to gain the benefit brought by the integrating the multiple content features. The corresponding experiment shows that PRI brings real performance improvement to the baseline and outperforms the existing fusion methods.
© 2018 Elsevier B.V. Person re-identification (re-ID) is challenging because pedestrians may exhibit distinct appearance under different cameras. Given a query image, previous methods usually output the person retrieval results directly, which may perform badly due to the limited information provided by the single query image. To mine more query information, we add an expansion step to post-process the initial ranking list. The intuition is that a true match in the gallery may be difficult to be found by the query alone, but it can be easily retrieved by other true matches in the initial ranking list. In this paper, we propose the Bayesian Query Expansion (BQE) method to generate a new query with information from the initial ranking list. The Bayesian model is used to predict true matches in the gallery. We apply pooling on the features of these 'true matches' to get a single vector, i.e., the expanded new query, with which the retrieval process is performed again to obtain the final results. We evaluate BQE with various feature extraction methods and distance metric learning methods on four large-scale re-ID datasets. We observe consistent improvement over all the baselines and report competitive performances compared with the state-of-the-art results.
Liu, R, Zhao, Y, Wei, S, Zheng, L & Yang, Y 2019, 'Modality-invariant image-text embedding for image-sentence matching', ACM Transactions on Multimedia Computing, Communications and Applications, vol. 15, no. 1.View/Download from: Publisher's site
© 2019 Association for Computing Machinery. Performing direct matching among different modalities (like image and text) can benefit many tasks in computer vision, multimedia, information retrieval, and information fusion. Most of existing works focus on class-level image-text matching, called cross-modal retrieval, which attempts to propose a uniform model for matching images with all types of texts, for example, tags, sentences, and articles (long texts). Although cross-model retrieval alleviates the heterogeneous gap among visual and textual information, it can provide only a rough correspondence between two modalities. In this article, we propose a more precise image-text embedding method, image-sentence matching, which can provide heterogeneous matching in the instance level. The key issue for image-text embedding is how to make the distributions of the two modalities consistent in the embedding space. To address this problem, some previous works on the cross-model retrieval task have attempted to pull close their distributions by employing adversarial learning. However, the effectiveness of adversarial learning on image-sentence matching has not been proved and there is still not an effective method. Inspired by previous works, we propose to learn a modality-invariant image-text embedding for image-sentence matching by involving adversarial learning. On top of the triplet loss-based baseline, we design a modality classification network with an adversarial loss, which classifies an embedding into either the image or text modality. In addition, the multi-stage training procedure is carefully designed so that the proposed network not only imposes the image-text similarity constraints by ground-truth labels, but also enforces the image and text embedding distributions to be similar by adversarial learning. Experiments on two public datasets (Flickr30k and MSCOCO) demonstrate that our method yields stable accuracy improvement over the baseline model and that our ...
Liu, W, Gong, D, Tan, M, Shi, Q, Yang, Y & Hauptmann, AG 2019, 'Learning Distilled Graph for Large-scale Social Network Data Clustering', IEEE Transactions on Knowledge and Data Engineering.View/Download from: Publisher's site
IEEE Spectral analysis is critical in social network analysis. As a vital step of the spectral analysis, the graph construction in many existing works utilizes content data only. Unfortunately, the content data often consists of noisy, sparse, and redundant features, which makes the resulting graph unstable and unreliable. In practice, besides the content data, social networks also contain link information, which provides additional information for graph construction. Some of previous works utilize the link data. However, the link data is often incomplete, which makes the resulting graph incomplete. To address these issues, we propose a Distilled Graph Clustering (DGC) method. It pursuits a distilled graph based on both the content data and the link data. The proposed algorithm alternates between two steps: in the feature selection step, it finds the most representative feature subset w.r.t. an intermediate graph initialized with link data; in graph distillation step, the proposed method updates and refines the graph based on only the selected features. The final resulting graph, which is referred to as the distilled graph, is then utilized for spectral clustering on the large-scale social network data. Extensive experiments demonstrate the superiority of the proposed method.
Wu, Y, Lin, Y, Dong, X, Yan, Y, Bian, W & Yang, Y 2019, 'Progressive Learning for Person Re-Identification With One Example', IEEE TRANSACTIONS ON IMAGE PROCESSING, vol. 28, no. 6, pp. 2872-2881.View/Download from: Publisher's site
Zhan, K, Chang, X, Guan, J, Chen, L, Ma, Z & Yang, Y 2019, 'Adaptive Structure Discovery for Multimedia Analysis Using Multiple Features.', IEEE transactions on cybernetics.View/Download from: UTS OPUS or Publisher's site
Multifeature learning has been a fundamental research problem in multimedia analysis. Most existing multifeature learning methods exploit graph, which must be computed beforehand, as input to uncover data distribution. These methods have two major problems confronted. First, graph construction requires calculating similarity based on nearby data pairs by a fixed function, e.g., the RBF kernel, but the intrinsic correlation among different data pairs varies constantly. Therefore, feature learning based on such predefined graphs may degrade, especially when there is dramatic correlation variation between nearby data pairs. Second, in most existing algorithms, each single-feature graph is computed independently and then combine them for learning, which ignores the correlation between multiple features. In this paper, a new unsupervised multifeature learning method is proposed to make the best utilization of the correlation among different features by jointly optimizing data correlation from multiple features in an adaptive way. As opposed to computing the affinity weight of data pairs by a fixed function, the weight of affinity graph is learned by a well-designed optimization problem. Additionally, the affinity graph of data pairs from different features is optimized in a global level to better leverage the correlation among different channels. In this way, the adaptive approach correlates the features of all features for a better learning process. Experimental results on real-world datasets demonstrate that our approach outperforms the state-of-the-art algorithms on leveraging multiple features for multimedia analysis.
A graph is usually formed to reveal the relationship between data points and graph structure is encoded by the affinity matrix. Most graph-based multiview clustering methods use predefined affinity matrices and the clustering performance highly depends on the quality of graph. We learn a consensus graph with minimizing disagreement between different views and constraining the rank of the Laplacian matrix. Since diverse views admit the same underlying cluster structure across multiple views, we use a new disagreement cost function for regularizing graphs from different views toward a common consensus. Simultaneously, we impose a rank constraint on the Laplacian matrix to learn the consensus graph with exactly connected components where is the number of clusters, which is different from using fixed affinity matrices in most existing graph-based methods. With the learned consensus graph, we can directly obtain the cluster labels without performing any post-processing, such as -means clustering algorithm in spectral clustering-based methods. A multiview consensus clustering method is proposed to learn such a graph. An efficient iterative updating algorithm is derived to optimize the proposed challenging optimization problem. Experiments on several benchmark datasets have demonstrated the effectiveness of the proposed method in terms of seven metrics.
Zheng, Z, Zheng, L & Yang, Y 2019, 'Pedestrian Alignment Network for Large-scale Person Re-identification', IEEE Transactions on Circuits and Systems for Video Technology.View/Download from: UTS OPUS or Publisher's site
IEEE Person re-identification (re-ID) is mostly viewed as an image retrieval problem. This task aims to search a query person in a large image pool. In practice, person re-ID usually adopts automatic detectors to obtain cropped pedestrian images. However, this process suffers from two types of detector errors: excessive background and part missing. Both errors deteriorate the quality of pedestrian alignment and may compromise pedestrian matching due to the position and scale variances. To address the misalignment problem, we propose that alignment be learned from an identification procedure. We introduce the pedestrian alignment network (PAN) which allows discriminative embedding learning pedestrian alignment without extra annotations. We observe that when the convolutional neural network (CNN) learns to discriminate between different identities, the learned feature maps usually exhibit strong activations on the human body rather than the background. The proposed network thus takes advantage of this attention mechanism to adaptively locate and align pedestrians within a bounding box. Visual examples show that pedestrians are better aligned with PAN. Experiments on three large-scale re-ID datasets confirm that PAN improves the discriminative ability of the feature embeddings and yields competitive accuracy with the state-of-the-art methods.
Zhong, Z, Zheng, L, Zheng, Z, Li, S & Yang, Y 2019, 'CamStyle: A Novel Data Augmentation Method for Person Re-Identification.', IEEE transactions on image processing : a publication of the IEEE Signal Processing Society, vol. 28, no. 3, pp. 1176-1190.View/Download from: UTS OPUS or Publisher's site
Person re-identification (re-ID) is a cross-camera retrieval task that suffers from image style variations caused by different cameras. The art implicitly addresses this problem by learning a camera-invariant descriptor subspace. In this paper, we explicitly consider this challenge by introducing camera style (CamStyle). CamStyle can serve as a data augmentation approach that reduces the risk of deep network overfitting and that smooths the CamStyle disparities. Specifically, with a style transfer model, labeled training images can be style transferred to each camera, and along with the original training samples, form the augmented training set. This method, while increasing data diversity against overfitting, also incurs a considerable level of noise. In the effort to alleviate the impact of noise, the label smooth regularization (LSR) is adopted. The vanilla version of our method (without LSR) performs reasonably well on few camera systems in which overfitting often occurs. With LSR, we demonstrate consistent improvement in all systems regardless of the extent of overfitting. We also report competitive accuracy compared with the state of the art on Market-1501 and DukeMTMC-re-ID. Importantly, CamStyle can be employed to the challenging problems of one view learning and unsupervised domain adaptation (UDA) in person re-identification (re-ID), both of which have critical research and application significance. The former only has labeled data in one camera view and the latter only has labeled data in the source domain. Experimental results show that CamStyle significantly improves the performance of the baseline in the two problems. Specially, for UDA, CamStyle achieves state-of-the-art accuracy based on a baseline deep re-ID model on Market-1501 and DukeMTMC-reID. Our code is available at: https://github.com/zhunzhong07/CamStyle .
Du, X, Yin, H, Huang, Z, Yang, Y & Zhou, X 2018, 'Exploiting detected visual objects for frame-level video filtering', World Wide Web, vol. 21, no. 5, pp. 1259-1284.View/Download from: UTS OPUS or Publisher's site
© 2017 Springer Science+Business Media, LLC Videos are generated at an unprecedented speed on the web. To improve the efficiency of access, developing new ways to filter the videos becomes a popular research topic. One on-going direction is using visual objects to perform frame-level video filtering. Under this direction, existing works create the unique object table and the occurrence table to maintain the connections between videos and objects. However, the creation process is not scalable and dynamic because it heavily depends on human labeling. To improve this, we propose to use detected visual objects to create these two tables for frame-level video filtering. Our study begins with investigating the existing object detection techniques. After that, we find object detection lacks the identification and connection abilities to accomplish the creation process alone. To supply these abilities, we further investigate three candidates, namely, recognizing-based, matching-based and tracking-based methods, to work with the object detection. Through analyzing the mechanism and evaluating the accuracy, we find that they are imperfect for identifying or connecting the visual objects. Accordingly, we propose a novel hybrid method that combines the matching-based and tracking-based methods to overcome the limitations. Our experiments show that the proposed method achieves higher accuracy and efficiency than the candidate methods. The subsequent analysis shows that the proposed method can efficiently support the frame-level video filtering using visual objects.
Fan, H, Zheng, L, Yan, C & Yang, Y 2018, 'Unsupervised Person Re-identification: Clustering and Fine-tuning', ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, vol. 14, no. 4.View/Download from: UTS OPUS or Publisher's site
Hu, Y, Zheng, L, Yang, Y & Huang, Y 2018, 'Twitter100k: A Real-World Dataset for Weakly Supervised Cross-Media Retrieval', IEEE Transactions on Multimedia, vol. 20, no. 4, pp. 927-938.View/Download from: UTS OPUS or Publisher's site
© 2017 IEEE. This paper contributes a new large-scale dataset for weakly supervised cross-media retrieval, named Twitter100k. Current datasets, such as Wikipedia, NUS Wide, and Flickr30k, have two major limitations. First, these datasets are lacking in content diversity, i.e., only some predefined classes are covered. Second, texts in these datasets are written in well-organized language, leading to inconsistency with realistic applications. To overcome these drawbacks, the proposed Twitter100k dataset is characterized by two aspects: it has 100 000 image-text pairs randomly crawled from Twitter, and thus, has no constraint in the image categories; and text in Twitter100k is written in informal language by the users. Since strongly supervised methods leverage the class labels that may be missing in practice, this paper focuses on weakly supervised learning for cross-media retrieval, in which only text-image pairs are exploited during training. We extensively benchmark the performance of four subspace learning methods and three variants of the correspondence AutoEncoder, along with various text features on Wikipedia, Flickr30k, and Twitter100k. As a minor contribution, we also design a deep neural network to learn cross-modal embeddings for Twitter100k. Inspired by the characteristic of Twitter100k, we propose a method to integrate optical character recognition into cross-media retrieval. The experiment results show that the proposed method improves the baseline performance.
Li, Z, Nie, F, Chang, X, Nie, L, Zhang, H & Yang, Y 2018, 'Rank-Constrained Spectral Clustering With Flexible Embedding.', IEEE transactions on neural networks and learning systems, vol. 29, no. 12, pp. 6073-6082.View/Download from: UTS OPUS or Publisher's site
Spectral clustering (SC) has been proven to be effective in various applications. However, the learning scheme of SC is suboptimal in that it learns the cluster indicator from a fixed graph structure, which usually requires a rounding procedure to further partition the data. Also, the obtained cluster number cannot reflect the ground truth number of connected components in the graph. To alleviate these drawbacks, we propose a rank-constrained SC with flexible embedding framework. Specifically, an adaptive probabilistic neighborhood learning process is employed to recover the block-diagonal affinity matrix of an ideal graph. Meanwhile, a flexible embedding scheme is learned to unravel the intrinsic cluster structure in low-dimensional subspace, where the irrelevant information and noise in high-dimensional data have been effectively suppressed. The proposed method is superior to previous SC methods in that: 1) the block-diagonal affinity matrix learned simultaneously with the adaptive graph construction process, more explicitly induces the cluster membership without further discretization; 2) the number of clusters is guaranteed to converge to the ground truth via a rank constraint on the Laplacian matrix; and 3) the mismatch between the embedded feature and the projected feature allows more freedom for finding the proper cluster structure in the low-dimensional subspace as well as learning the corresponding projection matrix. Experimental results on both synthetic and real-world data sets demonstrate the promising performance of the proposed algorithm.
Li, Z, Nie, F, Chang, X, Yang, Y, Zhang, C & Sebe, N 2018, 'Dynamic Affinity Graph Construction for Spectral Clustering Using Multiple Features.', IEEE Transactions on Neural Networks and Learning Systems, vol. 29, no. 12, pp. 6323-6332.View/Download from: UTS OPUS or Publisher's site
Spectral clustering (SC) has been widely applied to various computer vision tasks, where the key is to construct a robust affinity matrix for data partitioning. With the increase in visual features, conventional SC methods are facing two challenges: 1) how to effectively generate an affinity matrix based on multiple features? and 2) how to deal with high-dimensional visual features which could be redundant? To address these issues mentioned earlier, we present a new approach to: 1) learn a robust affinity matrix using multiple features, allowing us to simultaneously determine optimal weights for each feature; and 2) decide a set of optimal projection matrixes, one for each feature, that decide the lower dimensional space, as well as the optimal affinity weight of each data pair in the lower dimensional space. There are two major advantages of our new approach over the existing clustering techniques. First, our approach assigns affinity weights for data points on a per-data-pair basis. The learning procedure avoids the explicit specification of the size of the neighborhood in the affinity matrix, and the bandwidth parameter required to compute the Gaussian kernel, both of which are sensitive and yet difficult to determine beforehand. Second, the affinity weights are based on the distances in a lower dimensional space, while the low-dimensional space is inferred according to the optimized affinity weights. Both variables are jointly optimized so as to leverage mutual benefits. The experimental results outperform the compared alternatives, which indicate that the proposed method is effective in simultaneously learning the affinity graph and feature fusion, resulting in better clustering results.
Liu, R, Wei, S, Zhao, Y & Yang, Y 2018, 'Indexing of the CNN features for the large scale image search', Multimedia Tools and Applications, vol. 77, no. 24, pp. 32107-32131.View/Download from: UTS OPUS or Publisher's site
© 2018, Springer Science+Business Media, LLC, part of Springer Nature. The convolutional neural network (CNN) features can give good description of image content, which usually represent an image with a single feature vector. Although CNN features are more compact than local descriptors, they still cannot efficiently deal with large-scale retrieval due to the linearly incremental cost of computation and storage. To address this issue, we build a simple but effective indexing framework on inverted table, which significantly decreases both search time and memory usage. First, several strategies are fully investigated to adapt inverted table to CNN features for compensating for quantization error. We use multiple assignment for the query and database images to increase the probability that relevant images are assigned to the same visual word obtained via clustering. Embedding codes are also introduced to improve retrieval accuracy by removing false matches. Second, a novel indexing framework that combines inverted table and hashing codes is proposed. This framework is faster than the reformed inverted tables with the introduced strategies. Experiment on several benchmark datasets demonstrates that our method yields faster retrieval speed compared to brute-force search. We also provide fair comparison between popular CNN features.
Liu, W, Chang, X, Yan, Y, Yang, Y & Hauptmann, AG 2018, 'Few-shot text and image classification via analogical transfer learning', ACM Transactions on Intelligent Systems and Technology, vol. 9, no. 6.View/Download from: UTS OPUS or Publisher's site
© 2018 ACM. Learning from very few samples is a challenge for machine learning tasks, such as text and image classifcation. Performance of such task can be enhanced via transfer of helpful knowledge from related domains, which is referred to as transfer learning. In previous transfer learning works, instance transfer learning algorithms mostly focus on selecting the source domain instances similar to the target domain instances for transfer. However, the selected instances usually do not directly contribute to the learning performance in the target domain. Hypothesis transfer learning algorithms focus on the model/parameter level transfer. They treat the source hypotheses as well-trained and transfer their knowledge in terms of parameters to learn the target hypothesis. Such algorithms directly optimize the target hypothesis by the observable performance improvements. However, they fail to consider the problem that instances that contribute to the source hypotheses may be harmful for the target hypothesis, as instance transfer learning analyzed. To relieve the aforementioned problems, we propose a novel transfer learning algorithm, which follows an analogical strategy. Particularly, the proposed algorithm frst learns a revised source hypothesis with only instances contributing to the target hypothesis. Then, the proposed algorithm transfers both the revised source hypothesis and the target hypothesis (only trained with a few samples) to learn an analogical hypothesis. We denote our algorithm as Analogical Transfer Learning. Extensive experiments on one synthetic dataset and three real-world benchmark datasets demonstrate the superior performance of the proposed algorithm.
Luo, M, Chang, X, Nie, L, Yang, Y, Hauptmann, AG & Zheng, Q 2018, 'An Adaptive Semisupervised Feature Analysis for Video Semantic Recognition', IEEE Transactions on Cybernetics, vol. 48, no. 2, pp. 648-660.View/Download from: UTS OPUS or Publisher's site
© 2017 IEEE. Video semantic recognition usually suffers from the curse of dimensionality and the absence of enough high-quality labeled instances, thus semisupervised feature selection gains increasing attentions for its efficiency and comprehensibility. Most of the previous methods assume that videos with close distance (neighbors) have similar labels and characterize the intrinsic local structure through a predetermined graph of both labeled and unlabeled data. However, besides the parameter tuning problem underlying the construction of the graph, the affinity measurement in the original feature space usually suffers from the curse of dimensionality. Additionally, the predetermined graph separates itself from the procedure of feature selection, which might lead to downgraded performance for video semantic recognition. In this paper, we exploit a novel semisupervised feature selection meth od from a new perspective. The primary assumption underlying our model is that the instances with similar labels should have a larger probability of being neighbors. Instead of using a predetermined similarity graph, we incorporate the exploration of the local structure into the procedure of joint feature selection so as to learn the optimal graph simultaneously. Moreover, an adaptive loss function is exploited to measure the label fitness, which significantly enhances model's robustness to videos with a small or substantial loss. We propose an efficient alternating optimization algorithm to solve the proposed challenging problem, together with analyses on its convergence and computational complexity in theory. Finally, extensive experimental results on benchmark datasets illustrate the effectiveness and superiority of the proposed approach on video semantic recognition related tasks.
Luo, M, Nie, F, Chang, X, Yang, Y, Hauptmann, AG & Zheng, Q 2018, 'Adaptive Unsupervised Feature Selection with Structure Regularization', IEEE Transactions on Neural Networks and Learning Systems, vol. 29, no. 4, pp. 944-956.View/Download from: UTS OPUS or Publisher's site
© 2012 IEEE. Feature selection is one of the most important dimension reduction techniques for its efficiency and interpretation. Since practical data in large scale are usually collected without labels, and labeling these data are dramatically expensive and time-consuming, unsupervised feature selection has become a ubiquitous and challenging problem. Without label information, the fundamental problem of unsupervised feature selection lies in how to characterize the geometry structure of original feature space and produce a faithful feature subset, which preserves the intrinsic structure accurately. In this paper, we characterize the intrinsic local structure by an adaptive reconstruction graph and simultaneously consider its multiconnected-components (multicluster) structure by imposing a rank constraint on the corresponding Laplacian matrix. To achieve a desirable feature subset, we learn the optimal reconstruction graph and selective matrix simultaneously, instead of using a predetermined graph. We exploit an efficient alternative optimization algorithm to solve the proposed challenging problem, together with the theoretical analyses on its convergence and computational complexity. Finally, extensive experiments on clustering task are conducted over several benchmark data sets to verify the effectiveness and superiority of the proposed unsupervised feature selection algorithm.
Wang, H, Wu, F, Lu, W, Yang, Y, Li, X, Li, X & Zhuang, Y 2018, 'Identifying Objective and Subjective Words via Topic Modeling', IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, vol. 29, no. 3, pp. 718-730.View/Download from: UTS OPUS or Publisher's site
Zeng, Z, Li, Z, Cheng, D, Zhang, H, Zhan, K & Yang, Y 2018, 'Two-Stream Multi-Rate Recurrent Neural Network for Video-Based Pedestrian Re-Identification', IEEE Transactions on Industrial Informatics.View/Download from: UTS OPUS or Publisher's site
IEEE Video-based pedestrian re-identification is a fundamental task in video surveillance and real-world applications, and has attracted much research attention recently. Its goal is to match pedestrians across multiple non-overlapping network cameras. In this paper we propose a novel two-stream multi-rate recurrent neural network for video-based pedestrian re-identification, which has two inherent benefits: (1) capture the static spatial and temporal information; (2) deal with motion speed variance. Given video sequences of pedestrians, we start with extracting spatial and motion features using two different deep neural networks. Then we combine them using a regularized fusion network, which aims to explore feature correlations. To step further, we feed the two features into a multi-rate recurrent network to exploit the temporal correlations, and more importantly, to take into consideration that pedestrians, sometimes even the same pedestrian, move in different speeds across different camera views. Extensive experiments have conducted on two real-world video-based pedestrian re-identification benchmarks: iLIDS-VID and PRID 2011 datasets. The experimental results confirm the superiority of the proposed method. Our code will be released upon acceptance.
Zhang, S, Yang, Y, Xiao, J, Liu, X, Yang, Y, Xie, D & Zhuang, Y 2018, 'Fusing geometric features for skeleton-based action recognition using multilayer LSTM Networks', IEEE Transactions on Multimedia, vol. 20, no. 9, pp. 2330-2343.View/Download from: UTS OPUS or Publisher's site
© 1999-2012 IEEE. Recent skeleton-based action recognition approaches achieve great improvement by using recurrent neural network (RNN) models. Currently, these approaches build an end-to-end network from coordinates of joints to class categories and improve accuracy by extending RNN to spatial domains. First, while such well-designed models and optimization strategies explore relations between different parts directly from joint coordinates, we provide a simple universal spatial modeling method perpendicular to the RNN model enhancement. Specifically, according to the evolution of previous work, we select a set of simple geometric features, and then seperately feed each type of features to a three-layer LSTM framework. Second, we propose a multistream LSTM architecture with a new smoothed score fusion technique to learn classification from different geometric feature streams. Furthermore, we observe that the geometric relational features based on distances between joints and selected lines outperform other features and the fusion results achieve the state-of-the-art performance on four datasets. We also show the sparsity of input gate weights in the first LSTM layer trained by geometric features and demonstrate that utilizing joint-line distances as input require less data for training.
Zheng, L, Yang, Y & Tian, Q 2018, 'SIFT Meets CNN: A Decade Survey of Instance Retrieval', IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 5, pp. 1224-1244.View/Download from: UTS OPUS or Publisher's site
In the early days, content-based image retrieval (CBIR) was studied with global features. Since 2003, image retrieval based on local descriptors (de facto SIFT) has been extensively studied for over a decade due to the advantage of SIFT in dealing with image transformations. Recently, image representations based on the convolutional neural network (CNN) have attracted increasing interest in the community and demonstrated impressive performance. Given this time of rapid evolution, this article provides a comprehensive survey of instance retrieval over the last decade. Two broad categories, SIFT-based and CNN-based methods, are presented. For the former, according to the codebook size, we organize the literature into using large/medium-sized/small codebooks. For the latter, we discuss three lines of methods, i.e., using pre-trained or fine-tuned CNN models, and hybrid methods. The first two perform a single-pass of an image to the network, while the last category employs a patch-based feature extraction scheme. This survey presents milestones in modern instance retrieval, reviews a broad selection of previous works in different categories, and provides insights on the connection between SIFT and CNN-based methods. After analyzing and comparing retrieval performance of different categories on several datasets, we discuss promising directions towards generic and specialized instance retrieval.
Zheng, L, Yang, Y & Zheng, Z 2018, 'A Discriminatively Learned CNN Embedding for Person Re-identification', ACM Transactions on Multimedia Computing, Communications and Applications, vol. 14, no. 1, pp. 1-20.View/Download from: UTS OPUS or Publisher's site
In this article, we revisit two popular convolutional neural networks in person re-identification (re-ID): verification and identification models. The two models have their respective advantages and limitations due to different loss functions. Here, we shed light on how to combine the two models to learn more discriminative pedestrian descriptors. Specifically, we propose a Siamese network that simultaneously computes the identification loss and verification loss. Given a pair of training images, the network predicts the identities of the two input images and whether they belong to the same identity. Our network learns a discriminative embedding and a similarity measurement at the same time, thus taking full usage of the re-ID annotations. Our method can be easily applied on different pretrained networks. Albeit simple, the learned embedding improves the state-of-the-art performance on two public person re-ID benchmarks. Further, we show that our architecture can also be applied to image retrieval. The code is available at https://github.com/layumi/2016_person_re-ID.
Chang, X & Yang, Y 2017, 'Semisupervised Feature Analysis by Mining Correlations Among Multiple Tasks', IEEE Transactions on Neural Networks and Learning Systems, vol. 28, no. 10, pp. 2294-2305.View/Download from: UTS OPUS or Publisher's site
In this paper, we propose a novel semisupervised feature selection framework by mining correlations among multiple tasks and apply it to different multimedia applications. Instead of independently computing the importance of features for each task, our algorithm leverages shared knowledge from multiple related tasks, thus improving the performance of feature selection. Note that the proposed algorithm is built upon an assumption that different tasks share some common structures. The proposed algorithm selects features in a batch mode, by which the correlations between various features are taken into consideration. Besides, considering the fact that labeling a large amount of training data in real world is both time-consuming and tedious, we adopt manifold learning, which exploits both labeled and unlabeled training data for a feature space analysis. Since the objective function is nonsmooth and difficult to solve, we propose an iteractive algorithm with fast convergence. Extensive experiments on different applications demonstrate that our algorithm outperforms the other state-of-the-art feature selection algorithms.
Chang, X, Ma, Z, Lin, M, Yang, Y & Hauptmann, AG 2017, 'Feature Interaction Augmented Sparse Learning for Fast Kinect Motion Detection', IEEE Transactions on Image Processing, vol. 26, no. 8, pp. 3911-3920.View/Download from: UTS OPUS or Publisher's site
© 2017 IEEE. The Kinect sensing devices have been widely used in current Human-Computer Interaction entertainment. A fundamental issue involved is to detect users' motions accurately and quickly. In this paper, we tackle it by proposing a linear algorithm, which is augmented by feature interaction. The linear property guarantees its speed whereas feature interaction captures the higher order effect from the data to enhance its accuracy. The Schatten-p norm is leveraged to integrate the ma in linear effect and the higher order nonlinear effect by mining the correlation between them. The resulted classification model is a desirable combination of speed and accuracy. We propose a novel solution to solve our objective function. Experiments are performed on three public Kinect-based entertainment data sets related to fitness and gaming. The results show that our method has its advantage for motion detection in a real-time Kinect entertaining environment.
Chang, X, Ma, Z, Yang, Y, Zeng, Z & Hauptmann, AG 2017, 'Bi-level semantic representation analysis for multimedia event detection', IEEE Transactions on Cybernetics, vol. 47, no. 5, pp. 1180-1197.View/Download from: UTS OPUS or Publisher's site
© 2013 IEEE. Multimedia event detection has been one of the major endeavors in video event analysis. A variety of approaches have been proposed recently to tackle this problem. Among others, using semantic representation has been accredited for its promising performance and desirable ability for human-understandable reasoning. To generate semantic representation, we usually utilize several external image/video archives and apply the concept detectors trained on them to the event videos. Due to the intrinsic difference of these archives, the resulted representation is presumable to have different predicting capabilities for a certain event. Notwithstanding, not much work is available for assessing the efficacy of semantic representation from t he source-level. On the other hand, it is plausible to perceive that some concepts are noisy for detecting a specific event. Motivated by these two shortcomings, we propose a bi-level semantic representation analyzing method. Regarding source-level, our method learns weights of semantic representation attained from different multimedia archives. Meanwhile, it restrains the negative influence of noisy or irrelevant concepts in the overall concept-level. In addition, we particularly focus on efficient multimedia event detection with few positive examples, which is highly appreciated in the real-world scenario. We perform extensive experiments on the challenging TRECVID MED 2013 and 2014 datasets with encouraging results that validate the efficacy of our proposed approach.
Chang, X, Yu, YL, Yang, Y & Xing, EP 2017, 'Semantic Pooling for Complex Event Analysis in Untrimmed Videos', IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 8, pp. 1617-1632.View/Download from: UTS OPUS or Publisher's site
© 1979-2012 IEEE. Pooling plays an important role in generating a discriminative video representation. In this paper, we propose a new semantic pooling approach for challenging event analysis tasks (e.g., event detection, recognition, and recounting) in long untrimmed Internet videos, especially when only a few shots/segments are relevant to the event of interest while many other shots are irrelevant or even misleading. The commonly adopted pooling strategies aggregate the shots indifferently in one way or another, resulting in a great loss of information. Instead, in this work we first define a novel notion of semantic saliency that assesses the relevance of each shot with the event of interest. We then prioritize the shots according to their saliency scores since shots that are semantically more salient are expected to contribute more to the final event analysis. Next, we propose a new isotonic regularizer that is able to exploit the constructed semantic ordering information. The resulting nearly-isotonic support vector machine classifier exhibits higher discriminative power in event analysis tasks. Computationally, we develop an efficient implementation using the proximal gradient algorithm, and we prove new and closed-form proximal steps. We conduct extensive experiments on three real-world video datasets and achieve promising improvements.
Li, Z, Nie, F, Chang, X & Yang, Y 2017, 'Beyond Trace Ratio: Weighted Harmonic Mean of Trace Ratios for Multiclass Discriminant Analysis', IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, vol. 29, no. 10, pp. 2100-2110.View/Download from: UTS OPUS or Publisher's site
Luo, M, Nie, F, Chang, X, Yang, Y, Hauptmann, AG & Zheng, Q 2017, 'Avoiding Optimal Mean ℓ2,1-Norm Maximization-Based Robust PCA for Reconstruction.', Neural Computation, vol. 29, no. 4, pp. 1124-1150.View/Download from: UTS OPUS or Publisher's site
Robust principal component analysis (PCA) is one of the most important dimension-reduction techniques for handling high-dimensional data with outliers. However, most of the existing robust PCA presupposes that the mean of the data is zero and incorrectly utilizes the average of data as the optimal mean of robust PCA. In fact, this assumption holds only for the squared [Formula: see text]-norm-based traditional PCA. In this letter, we equivalently reformulate the objective of conventional PCA and learn the optimal projection directions by maximizing the sum of projected difference between each pair of instances based on [Formula: see text]-norm. The proposed method is robust to outliers and also invariant to rotation. More important, the reformulated objective not only automatically avoids the calculation of optimal mean and makes the assumption of centered data unnecessary, but also theoretically connects to the minimization of reconstruction error. To solve the proposed nonsmooth problem, we exploit an efficient optimization algorithm to soften the contributions from outliers by reweighting each data point iteratively. We theoretically analyze the convergence and computational complexity of the proposed algorithm. Extensive experimental results on several benchmark data sets illustrate the effectiveness and superiority of the proposed method.
Ma, Z, Chang, X, Yang, Y, Sebe, N & Hauptmann, AG 2017, 'The Many Shades of Negativity', IEEE Transactions on Multimedia, vol. 19, no. 7, pp. 1558-1568.View/Download from: UTS OPUS or Publisher's site
© 1999-2012 IEEE. Complex event detection has been progressively researched in recent years for the broad interest of video indexing and retrieval. To fulfill the purpose of event detection, one needs to train a classifier using both positive and negative examples. Current classifier training treats the negative videos as equally negative. However, we notice that many negative videos resemble the positive videos in different degrees. Intuitively, we may capture more informative cues from the negative videos if we assign them fine-grained labels, thus benefiting the classifier learning. Aiming for this, we use a statistical method on both the positive and negative examples to get the decisive attributes of a specific event. Based on these decisive attributes, we assign the fine-grained labels to negative examples to treat them differently for more effective exploitation. The resulting fine-grained labels may be not optimal to capture the discriminative cues from the negative videos. Hence, we propose to jointly optimize the fine-grained labels with the classifier learning, which brings mutual reciprocality. Meanwhile, the labels of positive examples are supposed to remain unchanged. We thus additionally introduce a constraint for this purpose. On the other hand, the state-of-the-art deep convolutional neural network features are leveraged in our approach for event detection to further boost the performance. Extensive experiments on the challenging TRECVID MED 2014 dataset have validated the efficacy of our proposed approach.
Nie, L, Wei, X, Zhang, D, Wang, X, Gao, Z & Yang, Y 2017, 'Data-Driven Answer Selection in Community QA Systems', IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, vol. 29, no. 6, pp. 1186-1198.View/Download from: UTS OPUS or Publisher's site
Wu, F, Wang, Z, Lu, W, Li, X, Yang, Y, Luo, J & Zhuang, Y 2017, 'Regularized Deep Belief Network for Image Attribute Detection', IEEE Transactions on Circuits and Systems for Video Technology, vol. 27, no. 7, pp. 1464-1477.View/Download from: UTS OPUS or Publisher's site
© 1991-2012 IEEE. In general, an image attribute is a human-nameable visual property that has a semantic connotation. Appropriate modeling of the intrinsic contextual correlations among attributes plays a fundamental role in attribute detection. In this paper, we consider image attribute detection from the perspective of regularized deep learning. In particular, we propose a regularized deep belief network (rDBN) to perform the image attribute detection task, which is composed of two parts: 1) a detection DBN (dDBN) that models the joint distribution of images and their corresponding attributes, which acts as an attribute detector and 2) a contextual restricted Boltzmann machine that explicitly models the correlations among attributes acting as a regularizer that restraints the output detection result given by the dDBN to meet the contextual prior of attributes. Furthermore, we propose an efficient fine-tuning scheme that can further optimize the performance of the dDBN by backpropagation. Experimental results show that the proposed rDBN obtains improvements over the state-of-the-art methods for attribute detection on the benchmark data sets.
Zhu, L, Xu, Z, Yang, Y & Hauptmann, AG 2017, 'Uncovering the Temporal Context for Video Question Answering', International Journal of Computer Vision, vol. 124, no. 3, pp. 409-421.View/Download from: UTS OPUS or Publisher's site
© 2017, Springer Science+Business Media, LLC. In this work, we introduce Video Question Answering in the temporal domain to infer the past, describe the present and predict the future. We present an encoder–decoder approach using Recurrent Neural Networks to learn the temporal structures of videos and introduce a dual-channel ranking loss to answer multiple-choice questions. We explore approaches for finer understanding of video content using the question form of 'fill-in-the-blank', and collect our Video Context QA dataset consisting of 109,895 video clips with a total duration of more than 1000 h from existing TACoS, MPII-MD and MEDTest 14 datasets. In addition, 390,744 corresponding questions are generated from annotations. Extensive experiments demonstrate that our approach significantly outperforms the compared baselines.
Zhuang, Y, Wang, H, Xiao, J, Wu, F, Yang, Y, Lu, W & Zhang, Z 2017, 'Bag-of-Discriminative-Words (BoDW) Representation via Topic Modeling', IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, vol. 29, no. 5, pp. 977-990.View/Download from: UTS OPUS or Publisher's site
Chang, X, Nie, F, Wang, S, Yang, Y, Zhou, X & Zhang, C 2016, 'Compound Rank-k Projections for Bilinear Analysis', IEEE Transactions on Neural Networks and Learning Systems, vol. 27, no. 7, pp. 1502-1513.View/Download from: UTS OPUS or Publisher's site
In many real-world applications, data are represented by matrices or
high-order tensors. Despite the promising performance, the existing
two-dimensional discriminant analysis algorithms employ a single projection
model to exploit the discriminant information for projection, making the model
less flexible. In this paper, we propose a novel Compound Rank-k Projection
(CRP) algorithm for bilinear analysis. CRP deals with matrices directly without
transforming them into vectors, and it therefore preserves the correlations
within the matrix and decreases the computation complexity. Different from the
existing two dimensional discriminant analysis algorithms, objective function
values of CRP increase monotonically.In addition, CRP utilizes multiple rank-k
projection models to enable a larger search space in which the optimal solution
can be found. In this way, the discriminant ability is enhanced.
Chang, X, Nie, F, Yang, Y, Zhang, C & Huang, H 2016, 'Convex Sparse PCA for Unsupervised Feature Learning', ACM Transactions on Knowledge Discovery from Data, vol. 11, no. 1, pp. 1-16.View/Download from: UTS OPUS or Publisher's site
Principal component analysis (PCA) has been widely applied to dimensionality reduction and data preprocessing
for different applications in engineering, biology, social science, and the like. Classical PCA
and its variants seek for linear projections of the original variables to obtain the low-dimensional feature
representations with maximal variance. One limitation is that it is difficult to interpret the results of PCA.
Besides, the classical PCA is vulnerable to certain noisy data. In this paper, we propose a Convex Sparse
Principal Component Analysis (CSPCA) algorithm and apply it to feature learning. First, we show that PCA
can be formulated as a low-rank regression optimization problem. Based on the discussion, the l2,1-norm
minimization is incorporated into the objective function to make the regression coefficients sparse, thereby
robust to the outliers. Also, based on the sparse model used in CSPCA, an optimal weight is assigned
to each of the original feature, which in turn provides the output with good interpretability. With the
output of our CSPCA, we can effectively analyze the importance of each feature under the PCA criteria.
Our new objective function is convex, and we propose an iterative algorithm to optimize it. We apply the
CSPCA algorithm to feature selection and conduct extensive experiments on seven benchmark datasets.
Experimental results demons
Chen, L, Li, X, Yang, Y, Kurniawati, H, Sheng, QZ, Hu, HY & Huang, N 2016, 'Personal health indexing based on medical examinations: A data mining approach', Decision Support Systems, vol. 81, pp. 54-65.View/Download from: UTS OPUS or Publisher's site
We design a method called MyPHI that predicts personal health index (PHI), a new evidence-based health indicator to explore the underlying patterns of a large collection of geriatric medical examination (GME) records using data mining techniques. We define PHI as a vector of scores, each reflecting the health risk in a particular disease category. The PHI prediction is formulated as an optimization problem that finds the optimal soft labels as health scores based on medical records that are infrequent, incomplete, and sparse. Our method is compared with classification models commonly used in medical applications. The experimental evaluation has demonstrated the effectiveness of our method based on a real-world GME data set collected from 102,258 participants.
Gan, C, Yang, Y, Zhu, L, Zhao, D & Zhuang, Y 2016, 'Recognizing an Action Using Its Name: A Knowledge-Based Approach', INTERNATIONAL JOURNAL OF COMPUTER VISION, vol. 120, no. 1, pp. 61-77.View/Download from: UTS OPUS or Publisher's site
Han, Y, Yang, Y & Zhou, X 2016, 'Guest editorial: web multimedia semantic inference using multi-cues', WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS, vol. 19, no. 2, pp. 177-179.View/Download from: UTS OPUS or Publisher's site
Wu, F, Fang, H, Li, X, Tang, S, Lu, W, Yang, Y, Zhu, W & Zhuang, Y 2016, 'Aspect Learning for Multimedia Summarization via Nonparametric Bayesian', IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, vol. 26, no. 10, pp. 1931-1942.View/Download from: UTS OPUS or Publisher's site
Xia, Y, Nie, L, Zhang, L, Yang, Y, Hong, R & Li, X 2016, 'Weakly Supervised Multilabel Clustering and its Applications in Computer Vision', IEEE TRANSACTIONS ON CYBERNETICS, vol. 46, no. 12, pp. 3220-3232.View/Download from: UTS OPUS or Publisher's site
Yan, Y, Nie, F, Li, W, Gao, C, Yang, Y & Xu, D 2016, 'Image Classification by Cross-Media Active Learning With Privileged Information', IEEE TRANSACTIONS ON MULTIMEDIA, vol. 18, no. 12, pp. 2494-2502.View/Download from: UTS OPUS or Publisher's site
Yu, SI, Xu, S, Ma, Z, Li, H, Hauptmann, AG, Chang, X, Yang, Y, Meng, D, Lin, M, Lan, Z, Gan, C, Xu, Z, Mao, Z, Li, X, Jiang, L & Du, X 2016, 'Strategies for searching video content with text queries or video examples', ITE Transactions on Media Technology and Applications, vol. 4, no. 3, pp. 227-238.View/Download from: UTS OPUS
© 2016 by ITE Transactions on Media Technology and Applications (MTA). The large number of user-generated videos uploaded on to the Internet everyday has led to many commercial video search engines, which mainly rely on text metadata for search. However, metadata is often lacking for user-generated videos, thus these videos are unsearchable by current search engines. Therefore, content-based video retrieval (CBVR) tackles this metadata-scarcity problem by directly analyzing the visual and audio streams of each video. CBVR encompasses multiple research topics, including low-level feature design, feature fusion, semantic detector training and video search/reranking. We present novel strategies in these topics to enhance CBVR in both accuracy and speed under different query inputs, including pure textual queries and query by video examples. Our proposed strategies have been incorporated into our submission for the TRECVID 2014 Multimedia Event Detection evaluation, where our system outperformed other submissions in both text queries and video example queries, thus demonstrating the effectiveness of our proposed approaches.
Zhang, L, Li, X, Nie, L, Yang, Y & Xia, Y 2016, 'Weakly Supervised Human Fixations Prediction', IEEE Transactions on Cybernetics, vol. 46, no. 1, pp. 258-269.View/Download from: UTS OPUS or Publisher's site
Automatically predicting human eye fixations is a
useful technique that can facilitate many multimedia applications,
e.g., image retrieval, action recognition, and photo
retargeting. Conventional approaches are frustrated by two
drawbacks. First, psychophysical experiments show that an
object-level interpretation of scenes influences eye movements
significantly. Most of the existing saliency models rely on object
detectors, and therefore, only a few prespecified categories can
be discovered. Second, the relative displacement of objects influences
their saliency remarkably, but current models cannot
describe them explicitly. To solve these problems, this paper
proposes weakly supervised fixations prediction, which leverages
image labels to improve accuracy of human fixations prediction.
The proposed model hierarchically discovers objects as well as
their spatial configurations. Starting from the raw image pixels,
we sample superpixels in an image, thereby seamless object
descriptors termed object-level graphlets (oGLs) are generated
by random walking on the superpixel mosaic. Then, a manifold
embedding algorithm is proposed to encode image labels
into oGLs, and the response map of each prespecified object is
computed accordingly. On the basis of the object-level response
map, we propose spatial-level graphlets (sGLs) to model the relative
positions among objects. Afterward, eye tracking data is
employed to integrate these sGLs for predicting human eye fixations.
Thorough experiment results demonstrate the advantage
of the proposed method over the state-of-the-art.
Zhang, L, Yang, Y, Nie, F & Shao, L 2016, 'Perception, Aesthetics, and Emotion in Multimedia Quality Modeling Introduction', IEEE MULTIMEDIA, vol. 23, no. 3, pp. 20-22.View/Download from: UTS OPUS or Publisher's site
Han, Y, Yang, Y & Wang, J 2015, 'Guest Editorial: Ad Hoc Web Multimedia Analysis with Limited Supervision', MULTIMEDIA TOOLS AND APPLICATIONS, vol. 74, no. 2, pp. 463-465.View/Download from: Publisher's site
Han, Y, Yang, Y, Wu, F & Hong, R 2015, 'Compact and Discriminative Descriptor Inference Using Multi-Cues', IEEE TRANSACTIONS ON IMAGE PROCESSING, vol. 24, no. 12, pp. 5114-5126.View/Download from: UTS OPUS or Publisher's site
Han, Y, Yang, Y, Yan, Y, Ma, Z, Sebe, N & Zhou, X 2015, 'Semisupervised Feature Selection via Spline Regression for Video Semantic Recognition', IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, vol. 26, no. 2, pp. 252-264.View/Download from: UTS OPUS or Publisher's site
Yan, Y, Yang, Y, Meng, D, Liu, G, Tong, W, Hauptmann, AG & Sebe, N 2015, 'Event Oriented Dictionary Learning for Complex Event Detection', IEEE TRANSACTIONS ON IMAGE PROCESSING, vol. 24, no. 6, pp. 1867-1878.View/Download from: UTS OPUS or Publisher's site
Yang, Y, Ma, Z, Nie, F, Chang, X & Hauptmann, AG 2015, 'Multi-Class Active Learning by Uncertainty Sampling with Diversity Maximization', International Journal of Computer Vision, vol. 113, no. 2.View/Download from: UTS OPUS or Publisher's site
As a way to relieve the tedious work of manual annotation, active learning plays important roles in many applications of visual concept recognition. In typical active learning scenarios, the number of labelled data in the seed set is usually small. However, most existing active learning algorithms only exploit the labelled data, which often suffers from over-fitting due to the small number of labelled examples. Besides, while much progress has been made in binary class active learning, little research attention has been focused on multi-class active learning. In this paper, we propose a semi-supervised batch mode multi-class active learning algorithm for visual concept recognition. Our algorithm exploits the whole active pool to evaluate the uncertainty of the data. Considering that uncertain data are always similar to each other, we propose to make the selected data as diverse as possible, for which we explicitly impose a diversity constraint on the objective function. As a multi-class active learning algorithm, our algorithm is able to exploit uncertainty across multiple classes. An efficient algorithm is used to optimize the objective function. Extensive experiments on action recognition, object classification, scene recognition, and event detection demonstrate its advantages.
Yang, Y, Ma, Z, Yang, Y, Nie, F & Shen, HT 2015, 'Multitask Spectral Clustering by Exploring Intertask Correlation', IEEE TRANSACTIONS ON CYBERNETICS, vol. 45, no. 5, pp. 1069-1080.View/Download from: UTS OPUS or Publisher's site
Han, Y, Wei, X, Cao, X, Yang, Y & Zhou, X 2014, 'Augmenting image descriptions using structured prediction output', IEEE Transactions on Multimedia, vol. 16, no. 6, pp. 1665-1676.View/Download from: UTS OPUS or Publisher's site
© 2014 IEEE. The need for richer descriptions of images arises in a wide spectrum of applications ranging from image understanding to image retrieval. While the Automatic Image Annotation (AIA) has been extensively studied, image descriptions with the output labels lack sufficient information. This paper proposes to augment image descriptions using structured prediction output. We define a hierarchical tree-structured semantic unit to describe images, from which we can obtain not only the class and subclass one image belongs to, but also the attributes one image has. After defining a new feature map function of structured SVM, we decompose the loss function into every node of the hierarchical tree-structured semantic unit and then predict the tree-structured semantic unit for testing images. In the experiments, we evaluate the performance of the proposed method on two open benchmark datasets and compare with the state-of-the-art methods. Experimental results show the better prediction performance of the proposed method and demonstrate the strength of augmenting image descriptions.
Visual attributes can be considered as a middle-level semantic cue that bridges the gap between low-level image features and high-level object classes. Thus, attributes have the advantage of transcending specific semantic categories or describing objects across categories. Since attributes are often human-nameable and domain specific, much work constructs attribute annotations ad hoc or take them from an application-dependent ontology. To facilitate other applications with attributes, it is necessary to develop methods which can adapt a well-defined set of attributes to novel images. In this paper, we propose a framework for image attribute adaptation. The goal is to automatically adapt the knowledge of attributes from a well-defined auxiliary image set to a target image set, thus assisting in predicting appropriate attributes for target images. In the proposed framework, we use a non-linear mapping function corresponding to multiple base kernels to map each training images of both the auxiliary and the target sets to a Reproducing Kernel Hilbert Space (RKHS), where we reduce the mismatch of data distributions between auxiliary and target images. In order to make use of un-labeled images, we incorporate a semi-supervised learning process. We also introduce a robust loss function into our framework to remove the shared irrelevance and noise of training images. Experiments on two couples of auxiliary-target image sets demonstrate that the proposed framework has better performance of predicting attributes for target testing images, compared to three baselines and two state-of-the-art domain adaptation methods. © 2014 IEEE.
With the advance of the Web 2.0 era came an explosive growth of geographical multimedia data shared on social network websites such as Flickr, YouTube, Facebook, and Zooomr. Location-aware media description, modeling, learning, and recommendation in pervasive social media analytics have become a key focus of the recent research in computer vision, multimedia, and signal processing societies. A new breed of multimedia applications that incorporates image/video annotation, visual search, content mining and recommendation, and so on may revolutionize the field. Combined with the popularity of location-aware social multimedia, location context data makes traditionally challenging problems more tractable. This special issue brings together active researchers to share recent progress in this exciting area. This issue highlights the latest developments in large-scale multiple evidence-based learning for geosocial multimedia computing and identifies several key challenges and potential innovations. © 2014 IEEE.
Li, P, Bu, J, Yang, Y, Ji, R, Chen, C & Cai, D 2014, 'Discriminative Orthogonal Nonnegative matrix factorization with flexibility for data representation', Expert Systems with Applications, vol. 41, no. 4, Part 1, pp. 1283-1293.View/Download from: UTS OPUS or Publisher's site
Learning an informative data representation is of vital importance in multidisciplinary applications, e.g., face analysis, document clustering and collaborative filtering. As a very useful tool, Nonnegative matrix factorization (NMF) is often employed to learn a well-structured data representation. While the geometrical structure of the data has been studied in some previous NMF variants, the existing works typically neglect the discriminant information revealed by the between-class scatter and the total scatter of the data. To address this issue, we present a novel approach named Discriminative Orthogonal Nonnegative matrix factorization (DON), which preserves both the local manifold structure and the global discriminant information simultaneously through manifold discriminant learning. In particular, to learn the discriminant structure for the data representation, we introduce the scaled indicator matrix, which naturally satisfies the orthogonality condition. Thus, we impose the orthogonality constraints on the objective function. However, too heavy constraints will lead to a very sparse data representation that is unexpected in reality. So we further make this orthogonality flexible. In addition, we provide the optimization framework with the convergence proof of the updating rules. Extensive comparisons over several state-of-the-art approaches demonstrate the efficacy of the proposed method. © 2013 Elsevier Ltd. All rights reserved.
Li, Z, Liu, J, Yang, Y, Zhou, X & Lu, H 2014, 'Clustering-Guided Sparse Structural Learning for Unsupervised Feature Selection', IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, vol. 26, no. 9, pp. 2138-2150.View/Download from: UTS OPUS or Publisher's site
Liu, J, Yang, Y, Huang, Z, Yang, Y & Shen, HT 2014, 'On the influence propagation of web videos', IEEE Transactions on Knowledge and Data Engineering, vol. 26, no. 8, pp. 1961-1973.View/Download from: UTS OPUS or Publisher's site
We propose a novel approach to analyze how a popular video is propagated in the cyberspace, to identify if it originated from a certain sharing-site, and to identify how it reached the current popularity in its propagation. In addition, we also estimate their influences across different websites outside the major hosting website. Web video is gaining significance due to its rich and eye-ball grabbing content. This phenomenon is evidently amplified and accelerated by the advance of Web 2.0. When a video receives some degree of popularity, it tends to appear on various websites including not only video-sharing websites but also news websites, social networks or even Wikipedia. Numerous video-sharing websites have hosted videos that reached a phenomenal level of visibility and popularity in the entire cyberspace. As a result, it is becoming more difficult to determine how the propagation took place-was the video a piece of original work that was intentionally uploaded to its major hosting site by the authors, or did the video originate from some small site then reached the sharing site after already getting a good level of popularity, or did it originate from other places in the cyberspace but the sharing site made it popular. Existing study regarding this flow of influence is lacking. Literature that discuss the problem of estimating a video's influence in the whole cyberspace also remains rare. In this article we introduce a novel framework to identify the propagation of popular videos from its major hosting site's perspective, and to estimate its influence. We define a Unified Virtual Community Space (UVCS) to model the propagation and influence of a video, and devise a novel learning method called Noise-reductive Local-and-Global Learning (NLGL) to effectively estimate a video's origin and influence. Without losing generality, we conduct experiments on annotated dataset collected from a major video sharing site to evaluate the effectiveness of the framework. Sur...
Liu, J, Zhang, P, Yu, T, Yang, Y & Qiu, H 2014, '[Effects of losartan on pulmonary dendritic cells in lipopolysaccharide- induced acute lung injury mice].', Zhonghua yi xue za zhi, vol. 94, no. 41, pp. 3216-3219.
OBJECTIVE: To assess the effects of losartan on the frequency and phenotype of respiratory dendritic cells (DC) in lipopolysaccharide (LPS)-induced acute lung injury (ALI) mice. METHODS: The C57BL/6 mice were randomly divided into 3 groups of control, ALI and ALI+losartan. ALI animals received 2 mg/kg of LPS; ALI+losartan animals 2 mg/kg of LPS and 15 mg/kg of losartan 30 min before an intratracheal injection of LPS; control animals phosphate buffer saline (PBS) instead of LPS. Lung wet weight/body weight (LW/BW) was recorded to assess lung injury. The pathological changes were examined under optical microscope. The frequency and phenotype of pulmonary DC were characterized by flow cytometry. Meanwhile, the levels of IL-6 in lung homogenates were assessed by enzyme-linked immunosorbent assay (ELISA). RESULTS: (1) The LPS-induced rise in LW/BW was partially prevented by a pretreatment of losartan. (2) Histologically, widespread alveolar wall thickening caused by edema, severe hemorrhage in interstitium and alveolus and marked and diffuse interstitial infiltration of inflammatory cells were observed in the ALI group. Whereas, losartan effectively attenuated the LPS-induced pulmonary hemorrhage, leukocytic infiltration in interstitium and alveolus. (3) Meanwhile, the levels of IL-6 in lung tissue were significantly enhanced in the LPS-induced ALI mice. Yet after a pretreatment of losartan, the pulmonary level of IL-6 markedly decreased. (4) LPS dosing resulted in a rapid accumulation of DC in lung tissues and an up-regulated expression of CD80 in LPS-induced ALI. In contrast, the expression of MHC II on respiratory DC was not significantly different among groups. A pretreatment of losartan led to a marked reduction in CD80 expression on pulmonary DC (P < 0.05 vs ALI). CONCLUSION: Losartan may down-regulate pulmonary injury by inhibiting the activation of pulmonary DC.
Ma, Z, Yang, Y, Nie, F, Sebe, N, Yan, S & Hauptmann, AG 2014, 'Harnessing lab knowledge for real-world action recognition', International Journal of Computer Vision, vol. 109, no. 1-2, pp. 60-73.View/Download from: UTS OPUS or Publisher's site
Much research on human action recognition has been oriented toward the performance gain on lab-collected datasets. Yet real-world videos are more diverse, with more complicated actions and often only a few of them are precisely labeled. Thus, recognizing actions from these videos is a tough mission. The paucity of labeled real-world videos motivates us to "borrow" strength from other resources. Specifically, considering that many lab datasets are available, we propose to harness lab datasets to facilitate the action recognition in real-world videos given that the lab and real-world datasets are related. As their action categories are usually inconsistent, we design a multi-task learning framework to jointly optimize the classifiers for both sides. The general Schatten $$p$ $ p -norm is exerted on the two classifiers to explore the shared knowledge between them. In this way, our framework is able to mine the shared knowledge between two datasets even if the two have different action categories, which is a major virtue of our method. The shared knowledge is further used to improve the action recognition in the real-world videos. Extensive experiments are performed on real-world datasets with promising results. © 2014 Springer Science+Business Media New York.
Ma, Z, Yang, Y, Sebe, N & Hauptmann, AG 2014, 'Knowledge adaptation with partially shared features for event detection using few exemplars', IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 9, pp. 1789-1802.View/Download from: UTS OPUS or Publisher's site
Multimedia event detection (MED) is an emerging area of research. Previous work mainly focuses on simple event detection in sports and news videos, or abnormality detection in surveillance videos. In contrast, we focus on detecting more complicated and generic events that gain more users' interest, and we explore an effective solution for MED. Moreover, our solution only uses few positive examples since precisely labeled multimedia content is scarce in the real world. As the information from these few positive examples is limited, we propose using knowledge adaptation to facilitate event detection. Different from the state of the art, our algorithm is able to adapt knowledge from another source for MED even if the features of the source and the target are partially different, but overlapping. Avoiding the requirement that the two domains are consistent in feature types is desirable as data collection platforms change or augment their capabilities and we should be able to respond to this with little or no effort. We perform extensive experiments on real-world multimedia archives consisting of several challenging events. The results show that our approach outperforms several other state-of-the-art detection algorithms. © 1979-2012 IEEE.
Mu, Y, Yang, Y, Cao, L, Yan, S & Tian, Q 2014, 'Guest Editorial: Special issue on large scale multimedia semantic indexing', COMPUTER VISION AND IMAGE UNDERSTANDING, vol. 124, pp. 1-2.View/Download from: Publisher's site
Song, J, Yang, Y, Li, X, Huang, Z & Yang, Y 2014, 'Robust Hashing with Local Models for Approximate Similarity Search', IEEE TRANSACTIONS ON CYBERNETICS, vol. 44, no. 7, pp. 1225-1236.View/Download from: UTS OPUS or Publisher's site
Tong, W, Yang, Y, Jiang, L, Yu, SI, Lan, Z, Ma, Z, Sze, W, Younessian, E & Hauptmann, AG 2014, 'E-LAMP: Integration of innovative ideas for multimedia event detection', Machine Vision and Applications, vol. 25, no. 1, pp. 5-15.View/Download from: UTS OPUS or Publisher's site
Detecting multimedia events in web videos is an emerging hot research area in the fields of multimedia and computer vision. In this paper, we introduce the core methods and technologies of the framework we developed recently for our Event Labeling through Analytic Media Processing (E-LAMP) system to deal with different aspects of the overall problem of event detection. More specifically, we have developed efficient methods for feature extraction so that we are able to handle large collections of video data with thousands of hours of videos. Second, we represent the extracted raw features in a spatial bag-of-words model with more effective tilings such that the spatial layout information of different features and different events can be better captured, thus the overa ll detection performance can be improved. Third, different from widely used early and late fusion schemes, a novel algorithm is developed to learn a more robust and discriminative intermediate feature representation from multiple features so that better event models can be built upon it. Finally, to tackle the additional challenge of event detection with only very few positive exemplars, we have developed a novel algorithm which is able to effectively adapt the knowledge learnt from auxiliary sources to assist the event detection. Both our empirical results and the official evaluation results on TRECVID MED'11 and MED'12 demonstrate the excellent performance of the integration of these ideas. © 2013 Springer-Verlag Berlin Heidelberg.
Wang, S, Ma, Z, Yang, Y, Li, X, Pang, C & Hauptmann, AG 2014, 'Semi-supervised multiple feature analysis for action recognition', IEEE Transactions on Multimedia, vol. 16, no. 2, pp. 289-298.View/Download from: UTS OPUS or Publisher's site
This paper presents a semi-supervised method for categorizing human actions using multiple visual features. The proposed algorithm simultaneously learns multiple features from a small number of labeled videos, and automatically utilizes data distributions between labeled and unlabeled data to boost the recognition performance. Shared structural analysis is applied in our approach to discover a common subspace shared by each type of feature. In the subspace, the proposed algorithm is able to characterize more discriminative information of each feature type. Additionally, data distribution information of each type of feature has been preserved. The aforementioned attributes make our algorithm robust for action recognition, especially when only limited labeled training samples are provided. Extensive experiments have been conducted on both the choreographed and the realistic video datasets, including KTH, Youtube action and UCF50. Experimental results show that our method outperforms several state-of-the-art algorithms. Most notably, much better performances have been achieved when there are only a few labeled training samples. © 1999-2012 IEEE.
Learning hash functions across heterogenous high-dimensional features is very desirable for many applications involving multi-modal data objects. In this paper, we propose an approach to obtain the sparse codesets for the data objects across different modalities via joint multi-modal dictionary learning, which we call sparse multi-modal hashing (abbreviated as SM 2 . In SM 2 , both intra-modality similarity and inter-modality similarity are first modeled by a hypergraph, then multi-modal dictionaries are jointly learned by Hypergraph Laplacian sparse coding. Based on the learned dictionaries, the sparse codeset of each data object is acquired and conducted for multi-modal approximate nearest neighbor retrieval using a sensitive Jaccard metric. The experimental results show that SM 2 outperforms other methods in terms of mAP and Percentage on two real-world data sets. © 2013 IEEE.
Yang, Y, Sebe, N, Snoek, C, Hua, X-S & Zhuang, Y 2014, 'Special section on learning from multiple evidences for large scale multimedia analysis', COMPUTER VISION AND IMAGE UNDERSTANDING, vol. 118, pp. 1-1.View/Download from: Publisher's site
Zhang, L, Song, M, Yang, Y, Zhao, Q, Zhao, C & Sebe, N 2014, 'Weakly supervised photo cropping', IEEE Transactions on Multimedia, vol. 16, no. 1, pp. 94-107.View/Download from: UTS OPUS or Publisher's site
Photo cropping is widely used in the printing industry, photography, and cinematography. Conventional photo cropping methods suffer from three drawbacks: 1) the semantics used to describe photo aesthetics are determined by the experience of model designers and specific data sets, 2) image global configurations, an essential cue to capture photos aesthetics, are not well preserved in the cropped photo, and 3) multi-channel visual features from an image region contribute differently to human aesthetics, but state-of-the-art photo cropping methods cannot automatically weight them. Owing to the recent progress in image retrieval community, image-level semantics, i.e., photo labels obtained without much human supervision, can be efficiently and effectively acquired. Thus, we propose weakly supervised photo cropping, where a manifold embedding algorithm is developed to incorporate image-level semantics and image global configurations with graphlets, or, small-sized connected subgraph. After manifold embedding, a Bayesian Network (BN) is proposed. It incorporates the testing photo into the framework derived from the multi-channel post-embedding graphlets of the training data, the importance of which is determined automatically. Based on the BN, photo cropping can be casted as searching the candidate cropped photo that maximally preserves graphlets from the training photos, and the optimal cropping parameter is inferred by Gibbs sampling. Subjective evaluations demonstrate that: 1) our approach outperforms several representative photo cropping methods, including our previous cropping model that is guided by semantics-free graphlets, and 2) the visualized graphlets explicitly capture photo semantics and global spatial configurations. © 1999-2012 IEEE.
Zhang, L, Yang, Y, Gao, Y, Yu, Y, Wang, C & Li, X 2014, 'A probabilistic associative model for segmenting weakly supervised images', IEEE Transactions on Image Processing, vol. 23, no. 9, pp. 4150-4159.View/Download from: UTS OPUS or Publisher's site
Weakly supervised image segmentation is an important yet challenging task in image processing and pattern recognition fields. It is defined as: in the training stage, semantic labels are only at the image-level, without regard to their specific object/scene location within the image. Given a test image, the goal is to predict the semantics of every pixel/superpixel. In this paper, we propose a new weakly supervised image segmentation model, focusing on learning the semantic associations between superpixel sets (graphlets in this paper). In particular, we first extract graphlets from each image, where a graphlet is a small-sized graph measures the potential of multiple spatially neighboring superpixels (i.e., the probability of these superpixels sharing a common semantic label, such as the sky or the sea). To compare different-sized graphlets and to incorporate image-level labels, a manifold embedding algorithm is designed to transform all graphlets into equal-length feature vectors. Finally, we present a hierarchical Bayesian network to capture the semantic associations between postembedding graphlets, based on which the semantics of each superpixel is inferred accordingly. Experimental results demonstrate that: 1) our approach performs competitively compared with the state-of-the-art approaches on three public data sets and 2) considerable performance enhancement is achieved when using our approach on segmentation-based photo cropping and image categorization. © 2014 IEEE.
Cao, X, Wei, X, Han, Y, Yang, Y, Sebe, N & Hauptmann, A 2013, 'Unified Dictionary Learning and Region Tagging with Hierarchical Sparse Representation', COMPUTER VISION AND IMAGE UNDERSTANDING, vol. 117, no. 8, pp. 934-946.View/Download from: UTS OPUS or Publisher's site
Gao, C, Meng, D, Yang, Y, Wang, Y, Zhou, X & Hauptmann, AG 2013, 'Infrared Patch-Image Model for Small Target Detection in a Single Image', IEEE TRANSACTIONS ON IMAGE PROCESSING, vol. 22, no. 12, pp. 4996-5009.View/Download from: UTS OPUS or Publisher's site
Liang, Z, Zhuang, Y, Yang, Y & Xiao, J 2013, 'Retrieval-based cartoon gesture recognition and applications via semi-supervised heterogeneous classifiers learning', PATTERN RECOGNITION, vol. 46, no. 1, pp. 412-423.View/Download from: UTS OPUS or Publisher's site
Ma, Z, Yang, Y, Sebe, N, Zheng, K & Hauptmann, AG 2013, 'Multimedia Event Detection Using A Classifier-Specific Intermediate Representation', IEEE TRANSACTIONS ON MULTIMEDIA, vol. 15, no. 7, pp. 1628-1637.View/Download from: UTS OPUS or Publisher's site
Song, J, Yang, Y, Huang, Z, Shen, HT & Luo, J 2013, 'Effective Multiple Feature Hashing for Large-Scale Near-Duplicate Video Retrieval', IEEE TRANSACTIONS ON MULTIMEDIA, vol. 15, no. 8, pp. 1997-2008.View/Download from: UTS OPUS or Publisher's site
Yang, Y, Huang, Z, Yang, Y, Liu, J, Shen, HT & Luo, J 2013, 'Local image tagging via graph regularized joint group sparsity', PATTERN RECOGNITION, vol. 46, no. 5, pp. 1358-1368.View/Download from: UTS OPUS or Publisher's site
Yang, Y, Ma, Z, Hauptmann, AG & Sebe, N 2013, 'Feature Selection for Multimedia Analysis by Sharing Information Among Multiple Tasks', IEEE TRANSACTIONS ON MULTIMEDIA, vol. 15, no. 3, pp. 661-669.View/Download from: UTS OPUS or Publisher's site
Yang, Y, Song, J, Huang, Z, Ma, Z, Sebe, N & Hauptmann, AG 2013, 'Multi-Feature Fusion via Hierarchical Regression for Multimedia Analysis', IEEE TRANSACTIONS ON MULTIMEDIA, vol. 15, no. 3, pp. 572-581.View/Download from: UTS OPUS or Publisher's site
Yang, Y, Yang, Y & Shen, HT 2013, 'Effective Transfer Tagging from Image to Video', ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, vol. 9, no. 2.View/Download from: UTS OPUS or Publisher's site
Yang, Y, Yang, Y, Shen, HT, Zhang, Y, Du, X & Zhou, X 2013, 'Discriminative Nonnegative Spectral Clustering with Out-of-Sample Extension', IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, vol. 25, no. 8, pp. 1760-1771.View/Download from: UTS OPUS or Publisher's site
Zhang, L, Han, Y, Yang, Y, Song, M, Yan, S & Tian, Q 2013, 'Discovering Discriminative Graphlets for Aerial Image Categories Recognition', IEEE TRANSACTIONS ON IMAGE PROCESSING, vol. 22, no. 12, pp. 5071-5084.View/Download from: UTS OPUS or Publisher's site
Feng, Y, Xiao, J, Zha, Z, Zhang, H & Yang, Y 2012, 'Active learning for social image retrieval using Locally Regressive Optimal Design', NEUROCOMPUTING, vol. 95, pp. 54-59.View/Download from: UTS OPUS or Publisher's site
Liu, Y, Wu, F, Yang, Y, Zhuang, Y & Hauptmann, AG 2012, 'Spline Regression Hashing for Fast Image Search', IEEE TRANSACTIONS ON IMAGE PROCESSING, vol. 21, no. 10, pp. 4480-4491.View/Download from: UTS OPUS or Publisher's site
Ma, Z, Nie, F, Yang, Y, Uijlings, JRR & Sebe, N 2012, 'Web Image Annotation Via Subspace-Sparsity Collaborated Feature Selection', IEEE TRANSACTIONS ON MULTIMEDIA, vol. 14, no. 4, pp. 1021-1030.View/Download from: UTS OPUS or Publisher's site
Ma, Z, Nie, F, Yang, Y, Uijlings, JRR, Sebe, N & Hauptmann, AG 2012, 'Discriminating Joint Feature Analysis for Multimedia Data Understanding', IEEE TRANSACTIONS ON MULTIMEDIA, vol. 14, no. 6, pp. 1662-1672.View/Download from: UTS OPUS or Publisher's site
Yang, Y, Nie, F, Xu, D, Luo, J, Zhuang, Y & Pan, Y 2012, 'A Multimedia Retrieval Framework Based on Semi-Supervised Ranking and Relevance Feedback', IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, vol. 34, no. 4, pp. 723-742.View/Download from: UTS OPUS or Publisher's site
Yang, Y, Wu, F, Nie, F, Shen, HT, Zhuang, Y & Hauptmann, AG 2012, 'Web and Personal Image Annotation by Mining Label Correlation With Relaxed Visual Graph Embedding', IEEE TRANSACTIONS ON IMAGE PROCESSING, vol. 21, no. 3, pp. 1339-1351.View/Download from: UTS OPUS or Publisher's site
Zha, Z-J, Wang, M, Zheng, Y-T, Yang, Y, Hong, R & Chua, T-S 2012, 'Interactive Video Indexing With Statistical Active Learning', IEEE TRANSACTIONS ON MULTIMEDIA, vol. 14, no. 1, pp. 17-27.View/Download from: UTS OPUS or Publisher's site
Chen, C, Yang, Y, Nie, F & Odobez, J-M 2011, '3D human pose recovery from image by efficient visual feature selection', COMPUTER VISION AND IMAGE UNDERSTANDING, vol. 115, no. 3, pp. 290-299.View/Download from: UTS OPUS or Publisher's site
Chen, C, Zhuang, Y, Nie, F, Yang, Y, Wu, F & Xiao, J 2011, 'Learning a 3D Human Pose Distance Metric from Geometric Pose Descriptor', IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, vol. 17, no. 11, pp. 1676-1689.View/Download from: UTS OPUS or Publisher's site
Pan, H & Yang, Y 2010, 'Combining location and feature information for multimedia retrieval', International Journal of Computer Applications in Technology, vol. 38, no. 1-3, pp. 27-33.View/Download from: Publisher's site
In this paper, we propose a cross-media retrieval method for heterogeneous multimedia data by which the query examples and the returned results can be of different modalities, e.g., to query images by an example of audio clip. Taking multimedia location and content information into consideration, an affinity propagation based clustering approach is proposed to analyse and fuse the information carried by the co-existing multimedia objects so as to learn the semantic correlations among the heterogeneous multimedia data and perform cross-media retrieval. We also propose active learning methods of Relevance Feedback to make the search model more accurate. Copyright © 2010 Inderscience Enterprises Ltd.
Wu, F, Wang, W, Yang, Y, Zhuang, Y & Nie, F 2010, 'Classification by semi-supervised discriminative regularization', NEUROCOMPUTING, vol. 73, no. 10-12, pp. 1641-1651.View/Download from: Publisher's site
Yang, Y, Wu, F, Xu, D, Zhuang, Y & Chia, L-T 2010, 'Cross-media retrieval using query dependent search methods', PATTERN RECOGNITION, vol. 43, no. 8, pp. 2927-2936.View/Download from: Publisher's site
Yang, Y, Xu, D, Nie, F, Yan, S & Zhuang, Y 2010, 'Image Clustering Using Local Discriminant Models and Global Integration', IEEE TRANSACTIONS ON IMAGE PROCESSING, vol. 19, no. 10, pp. 2761-2773.View/Download from: Publisher's site
Yang, Y, Zhuang, Y, Tao, D, Xu, D, Yu, J & Luo, J 2010, 'Recognizing Cartoon Image Gestures for Retrieval and Interactive Cartoon Clip Synthesis', IEEE Transactions on Circuits and Systems for Video Technology, vol. 20, no. 12, pp. 1745-1756.View/Download from: UTS OPUS or Publisher's site
In this paper, we propose a new method to recognize gestures of cartoon images with two practical applications, i.e., content-based cartoon image retrieval and interactive cartoon clip synthesis. Upon analyzing the unique properties of four types of features including global color histogram, local color histogram (LCH), edge feature (EF), and motion direction feature (MDF), we propose to employ different features for different purposes and in various phases. We use EF to define a graph and then refine its local structure by LCH. Based on this graph, we adopt a transductive learning algorithm to construct local patches for each cartoon image. A spectral method is then proposed to optimize the local structure of each patch and then align these patches globally. MDF is fused with EF and LCH and a cartoon gesture space is constructed for cartoon image gesture recognition. We apply the proposed method to content-based cartoon image retrieval and interactive cartoon clip synthesis. The experiments demonstrate the effectiveness of our method.
Yang, Y, Guo, T, Zhuang, Y & Wang, W 2009, 'Cross-media retrieval based on synthesis reasoning model', Jisuanji Fuzhu Sheji Yu Tuxingxue Xuebao/Journal of Computer-Aided Design and Computer Graphics, vol. 21, no. 9, pp. 1307-1314.
To gain better cross-media retrieval performance, it is crucial to mine the semantic correlations among the heterogeneous multimedia data. In this paper, we adopt the synthesis reasoning model as the underlying mechanism to mining the multimedia semantics for cross-media retrieval. We construct the synthesis reasoning sources according to the multimedia object low-level features and the reasoning source intensity field according to the multimedia co-existence information. A series of multimedia semantic spaces are built by spectral method after synthesis reasoning. The cross-media retrieval is performed on a per-query basis by which different retrieval methods are adopted for different queries. Both short term and long term relevance feedback are learned to introduce the new multimedia objects into the multimedia semantic spaces which were not in the training set, to refine the reasoning result. Experimental results show that the proposed methods can be used to accurately mine the multimedia semantics and the approach of cross-media retrieval is accurate and stable.
Yang, Y, Zhuang, Y-T, Wu, F & Pan, Y-H 2008, 'Harmonizing hierarchical manifolds for multimedia document semantics understanding and cross-media retrieval', IEEE TRANSACTIONS ON MULTIMEDIA, vol. 10, no. 3, pp. 437-446.View/Download from: Publisher's site
Zhuang, YT, Yang, Y & Wu, F 2008, 'Mining semantic correlation of heterogeneous multimedia data for cross-media retrieval', IEEE Transactions on Multimedia, vol. 10, no. 2, pp. 221-229.View/Download from: Publisher's site
Although multimedia objects such as images, as udios and texts are of different modalities, there are a great amount of semantic correlations among them. In this paper, we propose a method of transductive learning to mine the semantic correlations among media objects of different modalities so that to achieve the cross-media retrieval. Cross-media retrieval is a new kind of searching technology by which the query examples and the returned results can be of different modalities, e.g., to query images by an example of audio. First, according to the media objects features and their co-existence information, we construct a uniform cross-media correlation graph, in which media objects of different modalities are represented uniformly. To perform the cross-media retrieval, a positive score is assigned to the query example; the score spreads along the graph and media objects of target modality or MMDs with the highest scores are returned. To boost the retrieval performance, we also propose different approaches of long-term and short-term relevance feedback to mine the information contained in the positive and negative examples. © 2008 IEEE.
Zhuang, Y-T, Yang, Y & Wu, F 2008, 'Mining semantic correlation of heterogeneous multimedia data for cross-media retrieval', IEEE TRANSACTIONS ON MULTIMEDIA, vol. 10, no. 2, pp. 221-229.View/Download from: Publisher's site
Cai, Y, Yang, Y, Hauptmann, A & Wactlar, H 2015, 'Monitoring and coaching the use of home medical devices' in Briassouli, A, Benois-Pineau, J & Hauptmann, A (eds), Health Monitoring and Personalized Feedback using Multimedia Data, Springer, Germany, pp. 265-283.View/Download from: UTS OPUS or Publisher's site
© Springer International Publishing Switzerland 2015. Despite the popularity of home medical devices, serious safety concerns have been raised, because the use-errors of home medical devices have linked to a large number of fatal hazards. To resolve the problem, we introduce a cognitive assistive system to automatically monitor the use of home medical devices. Being able to accurately recognize user operations is one of the most important functionalities of the proposed system. However, even though various action recognition algorithms have been proposed in recent years, it is still unknown whether they are adequate for recognizing operations in using home medical devices. Since the lack of the corresponding database is the main reason causing the situation, at the first part of this paper, we present a database specially designed for studying the use of home medical devices. Then, we evaluate the performance of the existing approaches on the proposed database. Although using state-of-art approaches which have demonstrated near perfect performance in recognizing certain general human actions, we observe significant performance drop when applying it to recognize device operations. We conclude that the tiny actions involved in using devices is one of the most important reasons leading to the performance decrease. To accurately recognize tiny actions, it's critical to focus on where the target action happens, namely the region of interest (ROI) and have more elaborate action modeling based on the ROI. Therefore, in the second part of this paper, we introduce a simple but effective approach to estimating ROI for recognizing tiny actions. The key idea of this method is to analyze the correlation between an action and the sub-regions of a frame. The estimated ROI is then used as a filter for building more accurate action representations. Experimental results show significant performance improvements over the baseline methods by using the estimated ROI for action recogn...
Chang, X, Huang, PY, Shen, YD, Liang, X, Yang, Y & Hauptmann, AG 2018, 'RCAA: Relational context-aware agents for person search', Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), European Conference on Computer Vision, Springer, Munich, Germany, pp. 86-102.View/Download from: UTS OPUS or Publisher's site
© Springer Nature Switzerland AG 2018. We aim to search for a target person from a gallery of whole scene images for which the annotations of pedestrian bounding boxes are unavailable. Previous approaches to this problem have relied on a pedestrian proposal net, which may generate redundant proposals and increase the computational burden. In this paper, we address this problem by training relational context-aware agents which learn the actions to localize the target person from the gallery of whole scene images. We incorporate the relational spatial and temporal contexts into the framework. Specifically, we propose to use the target person as the query in the query-dependent relational network. The agent determines the best action to take at each time step by simultaneously considering the local visual information, the relational and temporal contexts, together with the target person. To validate the performance of our approach, we conduct extensive experiments on the large-scale Person Search benchmark dataset and achieve significant improvements over the compared approaches. It is also worth noting that the proposed model even performs better than traditional methods with perfect pedestrian detectors.
Deng, W, Zheng, L, Ye, Q, Kang, G, Yang, Y & Jiao, J 2018, 'Image-Image Domain Adaptation with Preserved Self-Similarity and Domain-Dissimilarity for Person Re-identification', 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, UT, pp. 994-1003.View/Download from: UTS OPUS or Publisher's site
Dong, X, Yan, Y, Ouyang, W & Yang, Y 2018, 'Style Aggregated Network for Facial Landmark Detection', 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, UT, USA.View/Download from: UTS OPUS or Publisher's site
Dong, X, Yu, SI, Weng, X, Wei, SE, Yang, Y & Sheikh, Y 2018, 'Supervision-by-Registration: An Unsupervised Approach to Improve the Precision of Facial Landmark Detectors', Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, UT, USA, pp. 360-368.View/Download from: UTS OPUS or Publisher's site
© 2018 IEEE. In this paper, we present supervision-by-registration, an unsupervised approach to improve the precision of facial landmark detectors on both images and video. Our key observation is that the detections of the same landmark in adjacent frames should be coherent with registration, i.e., optical flow. Interestingly, coherency of optical flow is a source of supervision that does not require manual labeling, and can be leveraged during detector training. For example, we can enforce in the training loss function that a detected landmark at framet-1 followed by optical flow tracking from framet-1 to framet should coincide with the location of the detection at framet. Essentially, supervision-by-registration augments the training loss function with a registration loss, thus training the detector to have output that is not only close to the annotations in labeled images, but also consistent with registration on large amounts of unlabeled videos. End-to-end training with the registration loss is made possible by a differentiable Lucas-Kanade operation, which computes optical flow registration in the forward pass, and back-propagates gradients that encourage temporal coherency in the detector. The output of our method is a more precise image-based facial landmark detector, which can be applied to single images or video. With supervision-by-registration, we demonstrate (1) improvements in facial landmark detection on both images (300W, ALFW) and video (300VW, Youtube-Celebrities), and (2) significant reduction of jittering in video detections.
Dong, X, Zhu, L, Zhang, D, Yang, Y & Wu, F 2018, 'Fast parameter adaptation for few-shot image captioning and visual question answering', MM 2018 - Proceedings of the 2018 ACM Multimedia Conference, pp. 54-62.View/Download from: UTS OPUS or Publisher's site
© 2018 Association for Computing Machinery. Given only a few image-text pairs, humans can learn to detect semantic concepts and describe the content. For machine learning algorithms, they usually require a lot of data to train a deep neural network to solve the problem. However, it is challenging for the existing systems to generalize well to the few-shot multi-modal scenario, because the learner should understand not only images and texts but also their relationships from only a few examples. In this paper, we tackle two multi-modal problems, i.e., image captioning and visual question answering (VQA), in the few-shot setting. We propose Fast Parameter Adaptation for Image-Text Modeling (FPAIT) that learns to learn jointly understanding image and text data by a few examples. In practice, FPAIT has two benefits. (1) Fast learning ability. FPAIT learns proper initial parameters for the joint image-text learner from a large number of different tasks. When a new task comes, FPAIT can use a small number of gradient steps to achieve a good performance. (2) Robust to few examples. In few-shot tasks, the small training data will introduce large biases in Convolutional Neural Networks (CNN) and damage the learner's performance. FPAIT leverages dynamic linear transformations to alleviate the side effects of the small training set. In this way, FPAIT flexibly normalizes the features and thus reduces the biases during training. Quantitatively, FPAIT achieves superior performance on both few-shot image captioning and VQA benchmarks.
Fan, H, Xu, Z, Zhu, L, Yan, C, Ge, J & Yang, Y 2018, 'Watching a small portion could be as good as watching all: Towards efficient video classification', IJCAI International Joint Conference on Artificial Intelligence, International Joint Conference on Artificial Intelligence, Stockholm, Sweden, pp. 705-711.View/Download from: UTS OPUS
© 2018 International Joint Conferences on Artificial Intelligence. All right reserved. We aim to significantly reduce the computational cost for classification of temporally untrimmed videos while retaining similar accuracy. Existing video classification methods sample frames with a predefined frequency over entire video. Differently, we propose an end-to-end deep reinforcement approach which enables an agent to classify videos by watching a very small portion of frames like what we do. We make two main contributions. First, information is not equally distributed in video frames along time. An agent needs to watch more carefully when a clip is informative and skip the frames if they are redundant or irrelevant. The proposed approach enables the agent to adapt sampling rate to video content and skip most of the frames without the loss of information. Second, in order to have a confident decision, the number of frames that should be watched by an agent varies greatly from one video to another. We incorporate an adaptive stop network to measure confidence score and generate timely trigger to stop the agent watching videos, which improves efficiency without loss of accuracy. Our approach reduces the computational cost significantly for the large-scale YouTube-8M dataset, while the accuracy remains the same.
He, Y, Kang, G, Dong, X, Fu, Y & Yang, Y 2019, 'Soft filter pruning for accelerating deep convolutional neural networks', IJCAI International Joint Conference on Artificial Intelligence, International Joint Conference on Artificial Intelligence, Stockholm, Sweden, pp. 2234-2240.View/Download from: UTS OPUS
© 2018 International Joint Conferences on Artificial Intelligence. All right reserved. This paper proposed a Soft Filter Pruning (SFP) method to accelerate the inference procedure of deep Convolutional Neural Networks (CNNs). Specifically, the proposed SFP enables the pruned filters to be updated when training the model after pruning. SFP has two advantages over previous works: (1) Larger model capacity. Updating previously pruned filters provides our approach with larger optimization space than fixing the filters to zero. Therefore, the network trained by our method has a larger model capacity to learn from the training data. (2) Less dependence on the pre-trained model. Large capacity enables SFP to train from scratch and prune the model simultaneously. In contrast, previous filter pruning methods should be conducted on the basis of the pre-trained model to guarantee their performance. Empirically, SFP from scratch outperforms the previous filter pruning methods. Moreover, our approach has been demonstrated effective for many advanced CNN architectures. Notably, on ILSCRC-2012, SFP reduces more than 42% FLOPs on ResNet-101 with even 0.2% top-5 accuracy improvement, which has advanced the state-of-the-art.
Kang, G, Zheng, L, Yan, Y & Yang, Y 2018, 'Deep Adversarial Attention Alignment for Unsupervised Domain Adaptation: The Benefit of Target Expectation Maximization', Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), European Conference on Computer Vision, Springer Link, Munich, Germany, pp. 420-436.View/Download from: UTS OPUS or Publisher's site
© 2018, Springer Nature Switzerland AG. In this paper, we make two contributions to unsupervised domain adaptation (UDA) using the convolutional neural network (CNN). First, our approach transfers knowledge in all the convolutional layers through attention alignment. Most previous methods align high-level representations, e.g., activations of the fully connected (FC) layers. In these methods, however, the convolutional layers which underpin critical low-level domain knowledge cannot be updated directly towards reducing domain discrepancy. Specifically, we assume that the discriminative regions in an image are relatively invariant to image style changes. Based on this assumption, we propose an attention alignment scheme on all the target convolutional layers to uncover the knowledge shared by the source domain. Second, we estimate the posterior label distribution of the unlabeled data for target network training. Previous methods, which iteratively update the pseudo labels by the target network and refine the target network by the updated pseudo labels, are vulnerable to label estimation errors. Instead, our approach uses category distribution to calculate the cross-entropy loss for training, thereby ameliorating the error accumulation of the estimated labels. The two contributions allow our approach to outperform the state-of-the-art methods by +2.6% on the Office-31 dataset.
Li, Z, Nie, F, Chang, X, Ma, Z & Yang, Y 2018, 'Balanced clustering via exclusive lasso: A pragmatic approach', 32nd AAAI Conference on Artificial Intelligence, AAAI 2018, AAAI Conference on Artificial Intelligence, AAAI, New Orleans, USA, pp. 3596-3603.View/Download from: UTS OPUS
Copyright © 2018, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Clustering is an effective technique in data mining to generate groups that are the matter of interest. Among various clustering approaches, the family of k-means algorithms and min-cut algorithms gain most popularity due to their simplicity and efficacy. The classical k-means algorithm partitions a number of data points into several subsets by iteratively updating the clustering centers and the associated data points. By contrast, a weighted undirected graph is constructed in min-cut algorithms which partition the vertices of the graph into two sets. However, existing clustering algorithms tend to cluster minority of data points into a subset, which shall be avoided when the target dataset is balanced. To achieve more accurate clustering for balanced dataset, we propose to leverage exclusive lasso on k-means and min-cut to regulate the balance degree of the clustering results. By optimizing our objective functions that build atop the exclusive lasso, we can make the clustering result as much balanced as possible. Extensive experiments on several large-scale datasets validate the advantage of the proposed algorithms compared to the state-of-the-art clustering algorithms.
Liu, W, Chang, X, Chen, L & Yang, Y 2018, 'Semi-supervised Bayesian attribute learning for person re-identification', Thirty-Second AAAI Conference on Artificial Intelligence, AAAI, New Orleans, Louisiana, USA.View/Download from: UTS OPUS
Person re-identification (re-ID) tasks aim to identify the same person in multiple images captured from non-overlapping camera views. Most previous re-ID studies have attempted to solve this problem through either representation learning or metric learning, or by combining both techniques. Representation learning relies on the latent factors or attributes of the data. In most of these works, the dimensionality of the factors/attributes has to be manually determined for each new dataset. Thus, this approach is not robust. Metric learning optimizes a metric across the dataset to measure similarity according to distance. However, choosing the optimal method for computing these distances is data dependent, and learning the appropriate metric relies on a sufficient number of pair-wise labels. To overcome these limitations, we propose a novel algorithm for person re-ID, called semi-supervised Bayesian attribute learning. We introduce an Indian Buffet Process to identify the priors of the latent attributes. The dimensionality of attributes factors is then automatically determined by nonparametric Bayesian learning. Meanwhile, unlike traditional distance metric learning, we propose a re-identification probability distribution to describe how likely it is that a pair of images contains the same person. This technique relies solely on the latent attributes of both images. Moreover, pair-wise labels that are not known can be estimated from pair-wise labels that are known, making this a robust approach for semi-supervised learning. Extensive experiments demonstrate the superior performance of our algorithm over several state-of-the-art algorithms on small-scale datasets and comparable performance on large-scale re-ID datasets.
Luo, Y, Zheng, Z, Zheng, L, Guan, T, Yu, J & Yang, Y 2018, 'Macro-micro adversarial network for human parsing', Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), European Conference on Computer Vision 2018, Munich, Germany, pp. 424-440.View/Download from: UTS OPUS or Publisher's site
© Springer Nature Switzerland AG 2018. In human parsing, the pixel-wise classification loss has drawbacks in its low-level local inconsistency and high-level semantic inconsistency. The introduction of the adversarial network tackles the two problems using a single discriminator. However, the two types of parsing inconsistency are generated by distinct mechanisms, so it is difficult for a single discriminator to solve them both. To address the two kinds of inconsistencies, this paper proposes the Macro-Micro Adversarial Net (MMAN). It has two discriminators. One discriminator, Macro D, acts on the low-resolution label map and penalizes semantic inconsistency, e.g., misplaced body parts. The other discriminator, Micro D, focuses on multiple patches of the high-resolution label map to address the local inconsistency, e.g., blur and hole. Compared with traditional adversarial networks, MMAN not only enforces local and semantic consistency explicitly, but also avoids the poor convergence problem of adversarial networks when handling high resolution images. In our experiment, we validate that the two discriminators are complementary to each other in improving the human parsing accuracy. The proposed framework is capable of producing competitive parsing performance compared with the state-of-the-art methods, i.e., mIoU = 46.81% and 59.91% on LIP and PASCAL-Person-Part, respectively. On a relatively small dataset PPSS, our pre-trained model demonstrates impressive generalization ability. The code is publicly available at https://github.com/RoyalVane/MMAN.
Sun, Y, Zheng, L, Yang, Y, Tian, Q & Wang, S 2018, 'Beyond Part Models: Person Retrieval with Refined Part Pooling (and A Strong Convolutional Baseline)', Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), European Conference on Computer Vision, Springer, Munich, Germany, pp. 501-518.View/Download from: UTS OPUS or Publisher's site
© 2018, Springer Nature Switzerland AG. Employing part-level features offers fine-grained information for pedestrian image description. A prerequisite of part discovery is that each part should be well located. Instead of using external resources like pose estimator, we consider content consistency within each part for precise part location. Specifically, we target at learning discriminative part-informed features for person retrieval and make two contributions. (i) A network named Part-based Convolutional Baseline (PCB). Given an image input, it outputs a convolutional descriptor consisting of several part-level features. With a uniform partition strategy, PCB achieves competitive results with the state-of-the-art methods, proving itself as a strong convolutional baseline for person retrieval. (ii) A refined part pooling (RPP) method. Uniform partition inevitably incurs outliers in each part, which are in fact more similar to other parts. RPP re-assigns these outliers to the parts they are closest to, resulting in refined parts with enhanced within-part consistency. Experiment confirms that RPP allows PCB to gain another round of performance boost. For instance, on the Market-1501 dataset, we achieve (77.4+4.2)% mAP and (92.3+1.5)% rank-1 accuracy, surpassing the state of the art by a large margin. Code is available at: https://github.com/syfafterzy/PCB_RPP.
Wang, H, Chang, X, Shi, L, Yang, Y & Shen, YD 2018, 'Uncertainty sampling for action recognition via maximizing expected average precision', IJCAI International Joint Conference on Artificial Intelligence, International Joint Conference on Artificial Intelligence, ACM, Stockholm, Sweden, pp. 964-970.View/Download from: UTS OPUS or Publisher's site
© 2018 International Joint Conferences on Artificial Intelligence. All right reserved. Recognizing human actions in video clips has been an important topic in computer vision. Sufficient labeled data is one of the prerequisites for the good performance of action recognition algorithms. However, while abundant videos can be collected from the Internet, categorizing each video clip is time-consuming. Active learning is one way to alleviate the labeling labor by allowing the classifier to choose the most informative unlabeled instances for manual annotation. Among various active learning algorithms, uncertainty sampling is arguably the most widely-used strategy. Conventional uncertainty sampling strategies such as entropy-based methods are usually tested under accuracy. However, in action recognition Average Precision (AP) is an acknowledged evaluation metric, which is somehow ignored in the active learning community. It is defined as the area under the precision-recall curve. In this paper, we propose a novel uncertainty sampling algorithm for action recognition using expected AP. We conduct experiments on three real-world action recognition datasets and show that our algorithm outperforms other uncertainty-based active learning algorithms.
Wu, Y, Lin, Y, Dong, X, Yan, Y, Ouyang, W & Yang, Y 2018, 'Exploit the Unknown Gradually: One-Shot Video-Based Person Re-Identification by Stepwise Learning', 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, UT, pp. 5177-5186.View/Download from: UTS OPUS or Publisher's site
Wu, Y, Zhu, L, Jiang, L & Yang, Y 2018, 'Decoupled novel object captioner', MM 2018 - Proceedings of the 2018 ACM Multimedia Conference, ACM International conference on Multimedia, ACM DL, Seoul, Republic of Korea, pp. 1029-1037.View/Download from: UTS OPUS or Publisher's site
© 2018 Association for Computing Machinery. Image captioning is a challenging task where the machine automatically describes an image by sentences or phrases. It often requires a large number of paired image-sentence annotations for training. However, a pre-trained captioning model can hardly be applied to a new domain in which some novel object categories exist, i.e., the objects and their description words are unseen during model training. To correctly caption the novel object, it requires professional human workers to annotate the images by sentences with the novel words. It is labor expensive and thus limits its usage in real-world applications. In this paper, we introduce the zero-shot novel object captioning task where the machine generates descriptions without extra training sentences about the novel object. To tackle the challenging problem, we propose a Decoupled Novel Object Captioner (DNOC) framework that can fully decouple the language sequence model from the object descriptions. DNOC has two components. 1) A Sequence Model with the Placeholder (SM-P) generates a sentence containing placeholders. The placeholder represents an unseen novel object. Thus, the sequence model can be decoupled from the novel object descriptions. 2) A key-value object memory built upon the freely available detection model, contains the visual information and the corresponding word for each object. A query generated from the SM-P is used to retrieve the words from the object memory. The placeholder will further be filled with the correct word, resulting in a caption with novel object descriptions. The experimental results on the held-out MSCOCO dataset demonstrate the ability of DNOC in describing novel concepts.
Yan, Y, Yang, T, Li, Z, Lin, Q & Yang, Y 2018, 'A unified analysis of stochastic momentum methods for deep learning', Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, International Joint Conference on Artificial Intelligence, International Joint Conferences on Artificial Intelligence, Stockholm, Sweden, pp. 2955-2961.View/Download from: UTS OPUS or Publisher's site
© 2018 International Joint Conferences on Artificial Intelligence. All right reserved. Stochastic momentum methods have been widely adopted in training deep neural networks. However, their theoretical analysis of convergence of the training objective and the generalization error for prediction is still under-explored. This paper aims to bridge the gap between practice and theory by analyzing the stochastic gradient (SG) method, and the stochastic momentum methods including two famous variants, i.e., the stochastic heavy-ball (SHB) method and the stochastic variant of Nesterov's accelerated gradient (SNAG) method. We propose a framework that unifies the three variants. We then derive the convergence rates of the norm of gradient for the non-convex optimization problem, and analyze the generalization performance through the uniform stability approach. Particularly, the convergence analysis of the training objective exhibits that SHB and SNAG have no advantage over SG. However, the stability analysis shows that the momentum term can improve the stability of the learned model and hence improve the generalization performance. These theoretical insights verify the common wisdom and are also corroborated by our empirical analysis on deep learning.
Zhang, X, Wei, Y, Feng, J, Yang, Y & Huang, T 2018, 'Adversarial Complementary Learning for Weakly Supervised Object Localization', Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, UT, USA, pp. 1325-1334.View/Download from: UTS OPUS or Publisher's site
© 2018 IEEE. In this work, we propose Adversarial Complementary Learning (ACoL) to automatically localize integral objects of semantic interest with weak supervision. We first mathematically prove that class localization maps can be obtained by directly selecting the class-specific feature maps of the last convolutional layer, which paves a simple way to identify object regions. We then present a simple network architecture including two parallel-classifiers for object localization. Specifically, we leverage one classification branch to dynamically localize some discriminative object regions during the forward pass. Although it is usually responsive to sparse parts of the target objects, this classifier can drive the counterpart classifier to discover new and complementary object regions by erasing its discovered regions from the feature maps. With such an adversarial learning, the two parallel-classifiers are forced to leverage complementary object regions for classification and can finally generate integral object localization together. The merits of ACoL are mainly two-fold: 1) it can be trained in an end-to-end manner; 2) dynamically erasing enables the counterpart classifier to discover complementary object regions more effectively. We demonstrate the superiority of our ACoL approach in a variety of experiments. In particular, the Top-1 localization error rate on the ILSVRC dataset is 45.14%, which is the new state-of-the-art.
Zhang, X, Wei, Y, Kang, G, Yang, Y & Huang, T 2018, 'Self-produced guidance for weakly-supervised object localization', Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), European Conference on Computer Vision, Springer, Munich, Germany, pp. 610-625.View/Download from: UTS OPUS or Publisher's site
© Springer Nature Switzerland AG 2018. Weakly supervised methods usually generate localization results based on attention maps produced by classification networks. However, the attention maps exhibit the most discriminative parts of the object which are small and sparse. We propose to generate Self-produced Guidance (SPG) masks which separate the foreground i.e., the object of interest, from the background to provide the classification networks with spatial correlation information of pixels. A stagewise approach is proposed to incorporate high confident object regions to learn the SPG masks. The high confident regions within attention maps are utilized to progressively learn the SPG masks. The masks are then used as an auxiliary pixel-level supervision to facilitate the training of classification networks. Extensive experiments on ILSVRC demonstrate that SPG is effective in producing high-quality object localizations maps. Particularly, the proposed SPG achieves the Top-1 localization error rate of 43.83% on the ILSVRC validation set, which is a new state-of-the-art error rate.
Zheng, L, Zhao, Y, Wang, S, Wang, J, Yang, Y & Tian, Q 2018, 'On the large-scale transferability of convolutional neural networks', Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer, Melbourne, VIC, Australia, pp. 27-39.View/Download from: UTS OPUS or Publisher's site
© Springer Nature Switzerland AG 2018. Given the overwhelming performance of the Convolutional Neural Network (CNN) in the computer vision and machine learning community, this paper aims at investigating the effective transfer of the CNN descriptors in generic and fine-grained classification at a large scale. Our contribution consists in providing some simple yet effective methods in constructing a competitive baseline recognition system. Comprehensively, we study two facts in CNN transfer. (1) We demonstrate the advantage of using images with a properly large size as input to CNN instead of the conventionally resized one. (2) We benchmark the performance of different CNN layers improved by average/max pooling on the feature maps. Our evaluation and observation confirm that the Conv5 descriptor yields very competitive accuracy under such a pooling strategy. Following these good practices, we are capable of producing improved performance on seven image classification benchmarks.
Zhong, Z, Zheng, L, Li, S & Yang, Y 2018, 'Generalizing a person retrieval model hetero- and homogeneously', Computer Vision – ECCV 2018 (LNCS 11217), European Conference on Computer Vision, Springer, Germany, pp. 176-192.View/Download from: UTS OPUS or Publisher's site
© Springer Nature Switzerland AG 2018. Person re-identification (re-ID) poses unique challenges for unsupervised domain adaptation (UDA) in that classes in the source and target sets (domains) are entirely different and that image variations are largely caused by cameras. Given a labeled source training set and an unlabeled target training set, we aim to improve the generalization ability of re-ID models on the target testing set. To this end, we introduce a Hetero-Homogeneous Learning (HHL) method. Our method enforces two properties simultaneously: (1) camera invariance, learned via positive pairs formed by unlabeled target images and their camera style transferred counterparts; (2) domain connectedness, by regarding source/target images as negative matching pairs to the target/source images. The first property is implemented by homogeneous learning because training pairs are collected from the same domain. The second property is achieved by heterogeneous learning because we sample training pairs from both the source and target domains. On Market-1501, DukeMTMC-reID and CUHK03, we show that the two properties contribute indispensably and that very competitive re-ID UDA accuracy is achieved. Code is available at: https://github.com/zhunzhong07/HHL.
Zhong, Z, Zheng, L, Zheng, Z, Li, S & Yang, Y 2018, 'Camera Style Adaptation for Person Re-identification', Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, UT, USA.View/Download from: UTS OPUS or Publisher's site
Zhu, L & Yang, Y 2018, 'Compound memory networks for few-shot video classification', Computer Vision – ECCV 2018 (LNCS), European Conference on Computer Vision, Springer, Munich, Germany, pp. 782-797.View/Download from: UTS OPUS or Publisher's site
© Springer Nature Switzerland AG 2018. In this paper, we propose a new memory network structure for few-shot video classification by making the following contributions. First, we propose a compound memory network (CMN) structure under the key-value memory network paradigm, in which each key memory involves multiple constituent keys. These constituent keys work collaboratively for training, which enables the CMN to obtain an optimal video representation in a larger space. Second, we introduce a multi-saliency embedding algorithm which encodes a variable-length video sequence into a fixed-size matrix representation by discovering multiple saliencies of interest. For example, given a video of car auction, some people are interested in the car, while others are interested in the auction activities. Third, we design an abstract memory on top of the constituent keys. The abstract memory and constituent keys form a layered structure, which makes the CMN more efficient and capable of being scaled, while also retaining the representation capability of the multiple keys. We compare CMN with several state-of-the-art baselines on a new few-shot video classification dataset and show the effectiveness of our approach.
Chang, X, Yu, YL & Yang, Y 2017, 'Robust top-k multiclass SVM for visual category recognition', Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, Halifax, Nova Scotia, Canada, pp. 75-83.View/Download from: UTS OPUS or Publisher's site
© 2017 Association for Computing Machinery. Classification problems with a large number of classes inevitably involve overlapping or similar classes. In such cases it seems reasonable to allow the learning algorithm to make mistakes on similar classes, as long as the true class is still among the top-k (say) predictions. Likewise, in applications such as search engine or ad display, we are allowed to present k predictions at a time and the customer would be satisfied as long as her interested prediction is included. Inspired by the recent work of , we propose a very generic, robust multiclass SVM formulation that directly aims at minimizing a weighted and truncated combination of the ordered prediction scores. Our method includes many previous works as special cases. Computationally, using the Jordan decomposition Lemma we show how to rewrite our objective as the difference of two convex functions, based on which we develop an eficient algorithm that allows incorporating many popular regularizers (such as the l2and l1norms). We conduct extensive experiments on four real large-scale visual category recognition datasets, and obtain very promising performances.
Dong, X, Huang, J, Yang, Y & Yan, S 2017, 'More is less: A more complicated network with less inference complexity', Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Honolulu, HI, USA, pp. 1895-1903.View/Download from: UTS OPUS or Publisher's site
© 2017 IEEE. In this paper, we present a novel and general network structure towards accelerating the inference process of convolutional neural networks, which is more complicated in network structure yet with less inference complexity. The core idea is to equip each original convolutional layer with another low-cost collaborative layer (LCCL), and the element-wise multiplication of the ReLU outputs of these two parallel layers produces the layer-wise output. The combined layer is potentially more discriminative than the original convolutional layer, and its inference is faster for two reasons: 1) the zero cells of the LCCL feature maps will remain zero after element-wise multiplication, and thus it is safe to skip the calculation of the corresponding high-cost convolution in the original convolutional layer; 2) LCCL is very fast if it is implemented as a 1 × 1 convolution or only a single filter shared by all channels. Extensive experiments on the CIFAR-10, CIFAR-100 and ILSCRC-2012 benchmarks show that our proposed network structure can accelerate the inference process by 32% on average with negligible performance drop.
Fan, H, Chang, X, Cheng, D, Yang, Y, Xu, D & Hauptmann, AG 2017, 'Complex Event Detection by Identifying Reliable Shots from Untrimmed Videos', Proceedings of the IEEE International Conference on Computer Vision, International Conference on Computer Vision, IEEE, Venice, Italy, pp. 736-744.View/Download from: UTS OPUS or Publisher's site
© 2017 IEEE. The goal of complex event detection is to automatically detect whether an event of interest happens in temporally untrimmed long videos which usually consist of multiple video shots. Observing some video shots in positive (resp. negative) videos are irrelevant (resp. relevant) to the given event class, we formulate this task as a multi-instance learning (MIL) problem by taking each video as a bag and the video shots in each video as instances. To this end, we propose a new MIL method, which simultaneously learns a linear SVM classifier and infers a binary indicator for each instance in order to select reliable training instances from each positive or negative bag. In our new objective function, we balance the weighted training errors and a l1-l2 mixed-norm regularization term which adaptively selects reliable shots as training instances from different videos to have them as diverse as possible. We also develop an alternating optimization approach that can efficiently solve our proposed objective function. Extensive experiments on the challenging real-world Multimedia Event Detection (MED) datasets MEDTest-14, MEDTest-13 and CCV clearly demonstrate the effectiveness of our proposed MIL approach for complex event detection.
Liu, W, Chang, X, Chen, L & Yang, Y 2017, 'Early Active Learning with Pairwise Constraint for Person Re-identification', ECML PKDD 2017: Machine Learning and Knowledge Discovery in Databases, Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Springer, Skopje, Macedonia, pp. 103-118.View/Download from: UTS OPUS or Publisher's site
Research on person re-identification (re-id) has attached much attention in the machine learning field in recent years. With sufficient labeled training data, supervised re-id algorithm can obtain promising performance. However, producing labeled data for training supervised re-id models is an extremely challenging and time-consuming task because it requires every pair of images across no-overlapping camera views to be labeled. Moreover, in the early stage of experiments, when labor resources are limited, only a small number of data can be labeled. Thus, it is essential to design an effective algorithm to select the most representative samples. This is referred as early active learning or early stage experimental design problem. The pairwise relationship plays a vital role in the re-id problem, but most of the existing early active learning algorithms fail to consider this relationship. To overcome this limitation, we propose a novel and efficient early active learning algorithm with a pairwise constraint for person re-identification in this paper. By introducing the pairwise constraint, the closeness of similar representations of instances is enforced in active learning. This benefits the performance of active learning for re-id. Extensive experimental results on four benchmark datasets confirm the superiority of the proposed algorithm.
Liu, Z, Wang, Z, Zhang, L, Shah, RR, Xia, Y, Yang, Y & Li, X 2017, 'FastShrinkage: Perceptually-aware retargeting toward mobile platforms', MM 2017 - Proceedings of the 2017 ACM Multimedia Conference, ACM on Multimedia Conference, Association for Computing Machinery, Mountain View, California, USA, pp. 501-509.View/Download from: UTS OPUS or Publisher's site
© 2017 ACM. Retargeting aims at adapting an original high-resolution photo/video to a low-resolution screen with an arbitrary aspect ratio. Conventional approaches are generally based on desktop PCs, since the computation might be intolerable for mobile platforms (especially when retargeting videos). Besides, only low-level visual features are exploited typically, whereas human visual perception is not well encoded. In this paper, we propose a novel retargeting framework which fast shrinks photo/video by leveraging human gaze behavior. Specifically, we first derive a geometry-preserved graph ranking algorithm, which efficiently selects a few salient object patches to mimic human gaze shifting path (GSP) when viewing each scenery. Afterward, an aggregation-based CNN is developed to hierarchically learn the deep representation for each GSP. Based on this, a probabilistic model is developed to learn the priors of the training photos which are marked as aesthetically-pleasing by professional photographers. We utilize the learned priors to efficiently shrink the corresponding GSP of a retargeted photo/video to be maximally similar to those from the training photos. Extensive experiments have demonstrated that: 1) our method consumes less than 35ms to retarget a 1024 × 768 photo (or a 1280 × 720 video frame) on popular iOS/Android devices, which is orders of magnitude faster than the conventional retargeting algorithms; 2) the retargeted photos/videos produced by our method outperform its competitors significantly based on the paired-comparison-based user study; and 3) the learned GSPs are highly indicative of human visual attention according to the human eye tracking experiments.
Luo, M, Nie, F, Chang, X, Yang, Y, Hauptmann, A & Zheng, Q 2017, 'Probabilistic non-negative matrix factorization and its robust extensions for topic modeling', 31st AAAI Conference on Artificial Intelligence, AAAI 2017, Conference on Artificial Intelligence, San Francisco, California USA, pp. 2308-2314.View/Download from: UTS OPUS
Copyright © 2017, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Traditional topic model with maximum likelihood estimate inevitably suffers from the conditional independence of words given the documents topic distribution. In this paper, we follow the generative procedure of topic model and learn the topic-word distribution and topics distribution via directly approximating the word-document co-occurrence matrix with matrix decomposition technique. These methods include: (1) Approximating the normalized document-word conditional distribution with the documents probability matrix and words probability matrix based on probabilistic non-negative matrix factorization (NMF); (2) Since the standard NMF is well known to be non-robust to noises and outliers, we extended the probabilistic NMF of the topic model to its robust versions using ℓ2, 1 -norm and capped ℓ2, 1 -norm based loss functions, respectively. The proposed framework inherits the explicit probabilistic meaning of factors in topic models and simultaneously makes the conditional independence assumption on words unnecessary. Straightforward and efficient algorithms are exploited to solve the corresponding non-smooth and non-convex problems. Experimental results over several benchmark datasets illustrate the effectiveness and superiority of the proposed methods.
Pan, P, Feng, J, Chen, L & Yang, Y 2017, 'Online compressed robust PCA', Proceedings of the International Joint Conference on Neural Networks, International Joint Conference on Neural Networks, IEEE, Anchorage, AK, USA, pp. 1041-1048.View/Download from: UTS OPUS or Publisher's site
© 2017 IEEE. In this work, we consider the problem of robust principal component analysis (RPCA) for streaming noisy data that has been highly compressed. This problem is prominent when one deals with high-dimensional and large-scale data and data compression is necessary. To solve this problem, we propose an online compressed RPCA algorithm to efficiently recover the low-rank components of raw data. Though data compression incurs severe information loss, we provide deep analysis on the proposed algorithm and prove that the low-rank component can be asymptotically recovered under mild conditions. Compared with other recent works on compressed RPCA, our algorithm reduces the memory cost significantly by processing data in an online fashion and reduces the communication cost by accepting sequential compressed data as input.
Xu, Z, Zhu, L & Yang, Y 2017, 'Few-shot object recognition from machine-labeled web images', Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Conference on Computer Vision and Pattern Recognition, IEEE, Honolulu, HI, USA, pp. 5358-5366.View/Download from: UTS OPUS or Publisher's site
© 2017 IEEE. With the tremendous advances made by Convolutional Neural Networks (ConvNets) on object recognition, we can now easily obtain adequately reliable machine-labeled annotations easily from predictions by off-the-shelf ConvNets. In this work, we present an "abstraction memory" based framework for few-shot learning, building upon machinelabeled image annotations. Our method takes large-scale machine-annotated dataset (e.g., OpenImages) as an external memory bank. In the external memory bank, the information is stored in the memory slots in the form of keyvalue, in which image feature is regarded as the key and the label embedding serves as the value. When queried by the few-shot examples, our model selects visually similar data from the external memory bank and writes the useful information obtained from related external data into another memory bank, i.e. abstraction memory. Long Short-Term Memory (LSTM) controllers and attention mechanisms are utilized to guarantee the data written to the abstraction memory correlates with the query example. The abstraction memory concentrates information from the external memory bank to make the few-shot recognition effective. In the experiments, we first confirm that our model can learn to conduct few-shot object recognition on clean humanlabeled data from the ImageNet dataset. Then, we demonstrate that with our model, machine-labeled image annotations are very effective and abundant resources for performing object recognition on novel categories. Experimental results show that our proposed model with machine-labeled annotations achieves great results, with only a 1% difference in accuracy between the machine-labeled annotations and the human-labeled annotations.
Yan, Y, Yang, T, Yang, Y & Chen, J 2017, 'A Framework of Online Learning with Imbalanced Streaming Data', Proceedings of the Thirty-Firs AAAI Conference on Artificial Intelligence, AAAI Conference on Artificial Intelligence, AAAI, San Francisco, USA, pp. 2817-2823.View/Download from: UTS OPUS
A challenge for mining large-scale streaming data overlooked
by most existing studies on online learning is the skewdistribution
of examples over different classes. Many previous
works have considered cost-sensitive approaches in an
online setting for streaming data, where fixed costs are assigned
to different classes, or ad-hoc costs are adapted based
on the distribution of data received so far. However, it is
not necessary for them to achieve optimal performance in
terms of the measures suited for imbalanced data, such as Fmeasure,
area under ROC curve (AUROC), area under precision
and recall curve (AUPRC). This work proposes a general
framework for online learning with imbalanced streaming
data, where examples are coming sequentially and models
are updated accordingly on-the-fly. By simultaneously learning
multiple classifiers with different cost vectors, the proposed
method can be adopted for different target measures for
imbalanced data, including F-measure, AUROC and AUPRC.
Moreover, we present a rigorous theoretical justification of
the proposed framework for the F-measure maximization.
Our empirical studies demonstrate the competitive if not better
performance of the proposed method compared to previous
cost-sensitive and resampling based online learning algorithms
and those that are designed for optimizing certain
Yang, Y 2017, 'A Dual-Network Progressive Approach to Weakly Supervised Object Detection', MM '17 Proceedings of the 2017 ACM on Multimedia Conference, ACM on Multimedia Conference, ACM, Mountain View, pp. 279-287.View/Download from: UTS OPUS or Publisher's site
A major challenge that arises in Weakly Supervised Object Detection (WSOD) is that only image-level labels are available, whereas WSOD trains instance-level object detectors. A typical approach to WSOD is to 1) generate a series of region proposals for each image and assign the image-level label to all the proposals in that image; 2) train a classifier using all the proposals; and 3) use the classifier to select proposals with high confidence scores as the positive instances for another round of training. In this way, the image-level labels are iteratively transferred to instance-level labels.
We aim to resolve the following two fundamental problems within this paradigm. First, existing proposal generation algorithms are not yet robust, thus the object proposals are often inaccurate. Second, the selected positive instances are sometimes noisy and unreliable, which hinders the training at subsequent iterations. We adopt two separate neural networks, one to focus on each problem, to better utilize the specific characteristic of region proposal refinement and positive instance selection. Further, to leverage the mutual benefits of the two tasks, the two neural networks are jointly trained and reinforced iteratively in a progressive manner, starting with easy and reliable instances and then gradually incorporating difficult ones at a later stage when the selection classifier is more robust. Extensive experiments on the PASCAL VOC dataset show that our method achieves state-of-the-art performance.
Zheng, L, Zhang, H, Sun, S, Chandraker, M, Yang, Y & Tian, Q 2017, 'Person Re-identification in the Wild', Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Honolulu, HI, USA.View/Download from: UTS OPUS or Publisher's site
This paper presents a novel large-scale dataset and comprehensive baselines for end-to-end pedestrian detection and person recognition in raw video frames. Our baselines address three issues: the performance of various combinations of detectors and recognizers, mechanisms for pedestrian detection to help improve overall re-identification (re-ID) accuracy and assessing the effectiveness of different detectors for re-ID. We make three distinct contributions. First, a new dataset, PRW, is introduced to evaluate Person Re-identification in the Wild, using videos acquired through six synchronized cameras. It contains 932 identities and 11,816 frames in which pedestrians are annotated with their bounding box positions and identities. Extensive benchmarking results are presented on this dataset. Second, we show that pedestrian detection aids re-ID through two simple yet effective improvements: a cascaded fine-tuning strategy that trains a detection model first and then the classification model, and a Confidence Weighted Similarity (CWS) metric that incorporates detection scores into similarity measurement. Third, we derive insights in evaluating detector performance for the particular scenario of accurate person re-ID
Zheng, Z, Zheng, L & Yang, Y 2017, 'Unlabeled Samples Generated by GAN Improve the Person Re-identification Baseline in vitro', Proceedings 2017 IEEE International Conference on Computer Vision CCV 2017, 2017 IEEE International Conference on Computer Vision, IEEE, Venice, Italy.View/Download from: UTS OPUS or Publisher's site
The main contribution of this paper is a simple semi- supervised pipeline that only uses the original training se t without collecting extra data. It is challenging in 1) how to obtain more training data only from the training set and 2) how to use the newly generated data. In this work, the generative adversarial network (GAN) is used to generate unlabeled samples. We propose the label smoothing regu- larization for outliers (LSRO). This method assigns a uni- form label distribution to the unlabeled images, which reg- ularizes the supervised model and improves the baseline. We verify the proposed method on a practical problem: person re-identification (re-ID). This task aims to retriev e a query person from other cameras. We adopt the deep con- volutional generative adversarial network (DCGAN) for sample generation, anda baseline convolutionalneuralnet - work (CNN) for representation learning. Experiments show that adding the GAN-generated data effectively improves the discriminative ability of learned CNN embeddings. On three large-scale datasets, Market-1501, CUHK03 and DukeMTMC-reID, we obtain +4.37%, +1.6% and +2.46% improvement in rank-1 precision over the baseline CNN, respectively. We additionally apply the proposed method to fine-grained bird recognition and achieve a +0.6% im- provement over a strong baseline
Zhu, L, Xu, Z & Yang, Y 2017, 'Bidirectional multirate reconstruction for temporal modeling in videos', Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Honolulu, HI, USA, pp. 1339-1348.View/Download from: UTS OPUS or Publisher's site
© 2017 IEEE. Despite the recent success of neural networks in image feature learning, a major problem in the video domain is the lack of sufficient labeled data for learning to model temporal information. In this paper, we propose an unsupervised temporal modeling method that learns from untrimmed videos. The speed of motion varies constantly, e.g., a man may run quickly or slowly. We therefore train a Multirate Visual Recurrent Model (MVRM) by encoding frames of a clip with different intervals. This learning process makes the learned model more capable of dealing with motion speed variance. Given a clip sampled from a video, we use its past and future neighboring clips as the temporal context, and reconstruct the two temporal transitions, i.e., present → past transition and present → future transition, reflecting the temporal information in different views. The proposed method exploits the two transitions simultaneously by incorporating a bidirectional reconstruction which consists of a backward reconstruction and a forward reconstruction. We apply the proposed method to two challenging video tasks, i.e., complex event detection and video captioning, in which it achieves state-of-the-art performance. Notably, our method generates the best single feature for event detection with a relative improvement of 10.4% on the MEDTest-13 dataset and achieves the best performance in video captioning across all evaluation metrics on the YouTube2Text dataset.
Chang, X, Yang, Y, Long, G, Zhang, C & Hauptmann, AG 2016, 'Dynamic concept composition for zero-example event detection', Proceedings of 30th AAAI Conference on Artificial Intelligence, AAAI 2016, AAAI Conference on Artificial Intelligence, Association for the Advancement of Artificial Intelligence, Phoenix, Arizona, United States, pp. 3464-3470.View/Download from: UTS OPUS
© Copyright 2016, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.In this paper, we focus on automatically detecting events in unconstrained videos without the use of any visual training exemplars. In principle, zero-shot learning makes it possible to train an event detection model based on the assumption that events (e.g. birthday party) can be described by multiple mid-level semantic concepts (e.g. "blowing candle", "birthday cake"). Towards this goal, we first pre-Train a bundle of concept classifiers using data from other sources. Then we evaluate the semantic correlation of each concept w.r.t. the event of interest and pick up the relevant concept classifiers, which are applied on all test videos to get multiple prediction score vectors. While most existing systems combine the predictions of the concept classifiers with fixed weights, we propose to learn the optimal weights of the concept classifiers for each testing video by exploring a set of online available videos with freeform text descriptions of their content. To validate the effectiveness of the proposed approach, we have conducted extensive experiments on the latest TRECVID MEDTest 2014, MEDTest 2013 and CCV dataset. The experimental results confirm the superiority of the proposed approach.
Yan, Y, Xu, Z, Tsang, W, Long, G & Yang, Y 2016, 'Robust Semi-supervised Learning through Label Aggregation', Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16), AAAI Conference on Artificial Intelligence, AAAI, Phoenix, USA, pp. 2244-2250.View/Download from: UTS OPUS
Semi-supervised learning is proposed to exploit both labeled
and unlabeled data. However, as the scale of data in real
world applications increases significantly, conventional semisupervised
algorithms usually lead to massive computational
cost and cannot be applied to large scale datasets. In addition,
label noise is usually present in the practical applications
due to human annotation, which very likely results in remarkable
degeneration of performance in semi-supervised methods.
To address these two challenges, in this paper, we propose
an efficient RObust Semi-Supervised Ensemble Learning
(ROSSEL) method, which generates pseudo-labels for
unlabeled data using a set of weak annotators, and combines
them to approximate the ground-truth labels to assist semisupervised
learning. We formulate the weighted combination
process as a multiple label kernel learning (MLKL) problem
which can be solved efficiently. Compared with other semisupervised
learning algorithms, the proposed method has linear
time complexity. Extensive experiments on five benchmark
datasets demonstrate the superior effectiveness, effi-
ciency and robustness of the proposed algorithm.
Chang, X, Yu, YL, Yang, Y & Xing, EP 2016, 'They are not equally reliable: Semantic event search using differentiated concept classifiers', Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Las Vegas, Nevada, United States, pp. 1884-1893.View/Download from: UTS OPUS or Publisher's site
Complex event detection on unconstrained Internet videos has seen much progress in recent years. However, state-of-the-art performance degrades dramatically when the number of positive training exemplars falls short. Since label acquisition is costly, laborious, and time-consuming, there is a real need to consider the much more challenging semantic event search problem, where no example video is given. In this paper, we present a state-of-the-art event search system without any example videos. Relying on the key observation that events (e.g. dog show) are usually compositions of multiple mid-level concepts (e.g. "dog," "theater," and "dog jumping"), we first train a skip-gram model to measure the relevance of each concept with the event of interest. The relevant concept classifiers then cast votes on the test videos but their reliability, due to lack of labeled training videos, has been largely unaddressed. We propose to combine the concept classifiers based on a principled estimate of their accuracy on the unlabeled test videos. A novel warping technique is proposed to improve the performance and an efficient highly-scalable algorithm is provided to quickly solve the resulting optimization. We conduct extensive experiments on the latest TRECVID MEDTest 2014, MEDTest 2013 and CCV datasets, and achieve state-of-the-art performances.
Du, X, Yin, H, Huang, Z, Yang, Y & Zhou, X 2016, 'Using detected visual objects to index video database', Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Australasian Database Conference, Springer, Sydney, New South Wales, Australia, pp. 333-345.View/Download from: UTS OPUS or Publisher's site
© Springer International Publishing AG 2016.In this paper, we focus on how to use visual objects to index the videos. Two tables are constructed for this purpose, namely the unique object table and the occurrence table. The former table stores the unique objects which appear in the videos, while the latter table stores the occurrence information of these unique objects in the videos. In previous works, these two tables are generated manually by a topdown process. That is, the unique object table is given by the experts at first, then the occurrence table is generated by the annotators according to the unique object table. Obviously, such process which heavily depends on human labors limits the scalability especially when the data are dynamic or large-scale. To improve this, we propose to perform a bottom-up process to generate these two tables. The novelties are: we use object detector instead of human annotation to create the occurrence table; we propose a hybrid method which consists of local merge, global merge and propagation to generate the unique object table and fix the occurrence table. In fact, there are another three candidate methods for implementing the bottom-up process, namely, recognizing-based, matching-based and tracking-based methods. Through analyzing their mechanism and evaluating their accuracy, we find that they are not suitable for the bottom-up process. The proposed hybrid method leverages the advantages of the matching-based and tracking-based methods. Our experiments show that the hybrid method is more accurate and efficient than the candidate methods, which indicates that it is more suitable for the proposed bottom-up process.
Luo, M, Nie, F, Chang, X, Yang, Y, Hauptmann, A & Zheng, Q 2016, 'Avoiding optimal mean robust PCA/2DPCA with non-greedy ℓ1-norm maximization', IJCAI International Joint Conference on Artificial Intelligence, International Joint Conference on Artificial Intelligence, AAAI Press / International Joint Conferences on Artificial Intelligence, New York City, New York, United States, pp. 1802-1808.View/Download from: UTS OPUS
Robust principal component analysis (PCA) is one of the most important dimension reduction techniques to handle high-dimensional data with outliers. However, the existing robust PCA presupposes that the mean of the data is zero and incorrectly utilizes the Euclidean distance based optimal mean for robust PCA with ℓ1-norm. Some studies consider this issue and integrate the estimation of the optimal mean into the dimension reduction objective, which leads to expensive computation. In this paper, we equivalently reformulate the maximization of variances for robust PCA, such that the optimal projection directions are learned by maximizing the sum of the projected difference between each pair of instances, rather than the difference between each instance and the mean of the data. Based on this reformulation, we propose a novel robust PCA to automatically avoid the calculation of the optimal mean based on ℓ1-norm distance. This strategy also makes the assumption of centered data unnecessary. Additionally, we intuitively extend the proposed robust PCA to its 2D version for image recognition. Efficient non-greedy algorithms are exploited to solve the proposed robust PCA and 2D robust PCA with fast convergence and low computational complexity. Some experimental results on benchmark data sets demonstrate the effectiveness and superiority of the proposed approaches on image reconstruction and recognition.
Pan, P, Xu, Z, Yang, Y, Wu, F & Zhuang, Y 2016, 'Hierarchical recurrent neural encoder for video representation with application to captioning', Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Seattle, WA, USA, pp. 1029-1038.View/Download from: UTS OPUS or Publisher's site
Recently, deep learning approach, especially deep Convolutional Neural Networks (ConvNets), have achieved overwhelming accuracy with fast processing speed for image classification. Incorporating temporal structure with deep ConvNets for video representation becomes a fundamental problem for video content analysis. In this paper, we propose a new approach, namely Hierarchical Recurrent Neural Encoder (HRNE), to exploit temporal information of videos. Compared to recent video representation inference approaches, this paper makes the following three contributions. First, our HRNE is able to efficiently exploit video temporal structure in a longer range by reducing the length of input information flow, and compositing multiple consecutive inputs at a higher level. Second, computation operations are significantly lessened while attaining more nonlinearity. Third, HRNE is able to uncover temporal transitions between frame chunks with different granularities, i.e. it can model the temporal transitions between frames as well as the transitions between segments. We apply the new method to video captioning where temporal information plays a crucial role. Experiments demonstrate that our method outperforms the state-of-the-art on video captioning benchmarks.
Yang, Y, Gan, C, Lin, M, de Molo, G & Hauptmann, AG 2016, 'Concepts not alone: Exploring pairwise relationships for zero-shot video activity recognition', Website Proceedings of 2016 AAAI Conference on Artificial Intelligence, AAAI, AAAI Conference on Artificial Intelligence, AAAI, Phoenix, Arizona, pp. 3487-3493.View/Download from: UTS OPUS
Vast quantities of videos are now being captured at astonishing rates, but the majority of these are not labelled. To cope with such data, we consider the task of content-based activity recognition in videos without any manually labelled examples, also known as zero-shot video recognition. To achieve this, videos are represented in terms of detected visual concepts, which are then scored as relevant or irrelevant according to their similarity with a given textual query. In this paper, we propose a more robust approach for scoring concepts in order to alleviate many of the brittleness and low precision problems of previous work. Not only do we jointly consider semantic relatedness, visual reliability, and discriminative power. To handle noise and non-linearities in the ranking scores of the selected concepts, we propose a novel pairwise order matrix approach for score aggregation. Extensive experiments on the large-scale TRECVID Multimedia Event Detection data show the superiority of our approach.
Yang, Y, Gan, C, Yao, T, Yang, K & Mei, T 2016, 'You lead, we exceed: Labor-free video concept learning by jointly exploiting web videos and images', Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on, IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Las Vegas, pp. 923-932.View/Download from: UTS OPUS or Publisher's site
Yan, Y, Tan, M, Yang, Y, Tsang, I, Zhang, C & Shi, Q 2015, 'Scalable maximum margin matrix factorization by active riemannian subspace search', Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 2015), International Joint Conference on Artificial Intelligence, AAAI, Buenos Aires, Argentina, pp. 3988-3994.View/Download from: UTS OPUS
The user ratings in recommendation systems are
usually in the form of ordinal discrete values. To
give more accurate prediction of such rating data,
maximum margin matrix factorization (M3F) was
proposed. Existing M3F algorithms, however, either
have massive computational cost or require expensive
model selection procedures to determine
the number of latent factors (i.e. the rank of the
matrix to be recovered), making them less practical
for large scale data sets. To address these two
challenges, in this paper, we formulate M3F with
a known number of latent factors as the Riemannian
optimization problem on a fixed-rank matrix
manifold and present a block-wise nonlinear Riemannian
conjugate gradient method to solve it ef-
ficiently. We then apply a simple and efficient active
subspace search scheme to automatically detect
the number of latent factors. Empirical studies on
both synthetic data sets and large real-world data
sets demonstrate the superior efficiency and effectiveness
of the proposed method.
Chang, X, Yang, Y, Hauptmann, A, Xing, EP & Yu, YL 2015, 'Semantic Concept Discovery for Large-Scale Zero-Shot Event Detection', Proceedings of the 24th International Conference on Artificial Intelligence, International Joint Conference of Artificial Intelligence, ACM, Buenos Aires, Argentina, pp. 2234-2240.View/Download from: UTS OPUS
We focus on detecting complex events in unconstrained Internet videos. While most existing works rely on the abundance of labeled training data, we consider a more difficult zero-shot setting where no training data is supplied. We first pre-train a number of concept classifiers using data from other sources. Then we evaluate the semantic correlation of each concept w.r.t. the event of interest. After further refinement to take prediction inaccuracy and discriminative power into account, we apply the discovered concept classifiers on all test videos and obtain multiple score vectors. These distinct score vectors are converted into pairwise comparison matrices and the nuclear norm rank aggregation framework is adopted to seek consensus. To address the challenging optimization formulation, we propose an efficient, highly scalable algorithm that is an order of magnitude faster than existing alternatives. Experiments on recent TRECVID datasets verify the superiority of the proposed approach. We focus on detecting complex events in unconstrained Internet videos. While most existing works rely on the abundance of labeled training data, we consider a more difficult zero-shot setting where no training data is supplied.We first pre-train a number of concept classifiers using data from other sources. Then we evaluate the semantic correlation of each concept w.r.t. the event of interest. After further refinement to take prediction inaccuracy and discriminative power into account, we apply the discovered concept classifiers on all test videos and obtain multiple score vectors. These distinct score vectors are converted into pairwise comparison matrices and the nuclear norm rank aggregation framework is adopted to seek consensus. To address the challenging optimization formulation, we propose an efficient, highly scalable algorithm that is an order of magnitude faster than existing alternatives. Experiments on recent TRECVID datasets verify the superiority of the proposed appr...
Chang, X, Yang, Y, Xing, EP & Yu, Y-L 2015, 'Complex Event Detection using Semantic Saliency and Nearly-Isotonic SVM', Proceedings of The 32nd International Conference on Machine Learning, International Conference on Machine Learning, International Machine Learning Society, Lille Grand, Paris, pp. 1348-1357.View/Download from: UTS OPUS
We aim to detect complex events in long Internet
videos that may last for hours. A major challenge
in this setting is that only a few shots in
a long video are relevant to the event of interest
while others are irrelevant or even misleading.
Instead of indifferently pooling the shots, we first
define a novel notion of semantic saliency that assesses
the relevance of each shot with the event
of interest. We then prioritize the shots according
to their saliency scores since shots that are
semantically more salient are expected to contribute
more to the final event detector. Next, we
propose a new isotonic regularizer that is able to
exploit the semantic ordering information. The
resulting nearly-isotonic SVM classifier exhibits
higher discriminative power. Computationally,
we develop an efficient implementation using
the proximal gradient algorithm, and we prove
new, closed-form proximal steps. We conduct
extensive experiments on three real-world video
datasets and confirm the effectiveness of the proposed
Chang, X, Yu, YL, Yang, Y & Hauptmann, AG 2015, 'Searching persuasively: Joint event detection and evidence recounting with limited supervision', Proceedings of the 23rd ACM international conference on Multimedia, ACM International Conference on Multimedia, ACM, Brisbane, Australia, pp. 581-590.View/Download from: UTS OPUS or Publisher's site
© 2015 ACM. Multimedia event detection (MED) and multimedia event recounting (MER) are fundamental tasks in managing large amounts of unconstrained web videos, and have attracted a lot of attention in recent years. Most existing systems perform MER as a postprocessing step on top of the MED results. In order to leverage the mutual benefits of the two tasks, we propose a joint framework that simultaneously detects high-level events and localizes the indicative concepts of the events. Our premise is that a good recounting algorithm should not only explain the detection result, but should also be able to assist detection in the first place. Coupled in a joint optimization framework, recounting improves detection by pruning irrelevant noisy concepts while detection directs recounting to the most discriminative evidences. To better utilize the powerful and interpretable semantic video representation, we segment each video into several shots and exploit the rich temporal structures at shot level. The consequent computational challenge is carefully addressed through a significant improvement of the current ADMM algorithm, which, after eliminating all inner loops and equipping novel closed-form solutions for all intermediate steps, enables us to efficiently process extremely large video corpora. We test the proposed method on the large scale TRECVID MEDTest 2014 and MEDTest 2013 datasets, and obtain very promising results for both MED and MER.
Chang, XJ, Nie, FP, Ma, ZG, Yang, Y & Zhou, XF 2015, 'A Convex Formulation for Spectral Shrunk Clustering', Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, AAAI Conference on Artificial Intelligence, AAAI, Austin Texas, USA, pp. 2532-2538.View/Download from: UTS OPUS
Spectral clustering is a fundamental technique in the field of data mining and information processing. Most existing spectral clustering algorithms integrate dimensionality reduction into the clustering process assisted by manifold learning in the original space. However, the manifold in reduced-dimensional subspace is likely to exhibit altered properties in contrast with the original space. Thus, applying manifold information obtained from the original space to the clustering process in a low-dimensional subspace is prone to inferior performance. Aiming to address this issue, we propose a novel convex algorithm that mines the manifold structure in the low-dimensional subspace. In addition, our unified learning process makes the manifold learning particularly tailored for the clustering. Compared with other related methods, the proposed algorithm results in more structured clustering result. To validate the efficacy of the proposed algorithm, we perform extensive experiments on several benchmark datasets in comparison with some state-of-the-art clustering approaches. The experimental results demonstrate that the proposed algorithm has quite promising clustering performance
Gan, C, Lin, M, Yang, Y, Zhuang, YT & Hauptmann, AG 2015, 'Exploring Semantic Inter-Class Relationships (SIR) for Zero-Shot Action Recognition', Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, AAAI Conference on Artificial Intelligence, AAAI, Austin, USA, pp. 3769-3775.View/Download from: UTS OPUS
Automatically recognizing a large number of action categories from videos is of significant importance for video understanding. Most existing works focused on the design of more discriminative feature representation, and have achieved promising results when the positive samples are enough. However, very limited efforts were spent on recognizing a novel action without any positive exemplars, which is often the case in the real settings due to the large amount of action classes and the users' queries dramatic variations. To address this issue, we propose to perform action recognition when no positive exemplars of that class are provided, which is often known as the zero-shot learning. Different from other zero-shot learning approaches, which exploit attributes as the intermediate layer for the knowledge transfer, our main contribution is SIR, which directly leverages the semantic inter-class relationships between the known and unknown actions followed by label transfer learning. The inter-class semantic relationships are automatically measured by continuous word vectors, which learned by the skip-gram model using the large-scale text corpus. Extensive experiments on the UCF101 dataset validate the superiority of our method over fully-supervised approaches using few positive exemplars
Gan, C, Wang, N, Yang, Y, Yeung, DY & Hauptmann, AG 2015, 'Devnet: A Deep Event Network for Multimedia Event Detection and Evidence Recounting', Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Boston, USA, pp. 2568-2577.View/Download from: UTS OPUS or Publisher's site
In this paper, we focus on complex event detection in internet videos while also providing the key evidences of the detection results. Convolutional Neural Networks (CNNs) have achieved promising performance in image classification and action recognition tasks. However, it remains an open problem how to use CNNs for video event detection and recounting, mainly due to the complexity and diversity of video events. In this work, we propose a flexible deep CNN infrastructure, namely Deep Event Network (DevNet), that simultaneously detects pre-defined events and provides key spatial-temporal evidences. Taking key frames of videos as input, we first detect the event of interest at the video level by aggregating the CNN features of the key frames. The pieces of evidences which recount the detection results, are also automatically localized, both temporally and spatially. The challenge is that we only have video level labels, while the key evidences usually take place at the frame levels. Based on the intrinsic property of CNNs, we first generate a spatial-temporal saliency map by back passing through DevNet, which then can be used to find the key frames which are most indicative to the event, as well as to localize the specific spatial position, usually an object, in the frame of the highly indicative area. Experiments on the large scale TRECVID 2014 MEDTest dataset demonstrate the promising performance of our method, both for event detection and evidence recounting.
Jiang, L, Yu, SI, Meng, D, Yang, Y, Mitamura, T & Hauptmann, AG 2015, 'Fast and accurate content-based semantic search in 100M Internet Videos', MM 2015 - Proceedings of the 2015 ACM Multimedia Conference, ACM International Conference on Multimedia, ACM, Brisbane, Australia, pp. 49-58.View/Download from: UTS OPUS or Publisher's site
© 2015 ACM. Large-scale content-based semantic search in video is an interesting and fundamental problem in multimedia analysis and retrieval. Existing methods index a video by the raw concept detection score that is dense and inconsistent, and thus cannot scale to "big data" that are readily available on the Internet. This paper proposes a scalable solution. The key is a novel step called concept adjustment that represents a video by a few salient and consistent concepts that can be efficiently indexed by the modified inverted index. The proposed adjustment model relies on a concise optimization framework with interpretations. The proposed index leverages the text-based inverted index for video retrieval. Experimental results validate the efficacy and the efficiency of the proposed method. The results show that our method can scale up the semantic search while maintaining state-of-Theart search performance. Specifically, the proposed method (with reranking) achieves the best result on the challenging TRECVID Multimedia Event Detection (MED) zeroexample task. It only takes 0.2 second on a single CPU core to search a collection of 100 million Internet videos.
Liu, G, Yan, Y, Ricci, E, Yang, Y, Han, Y, Winkler, S & Sebe, N 2015, 'Inferring Painting Style with Multi-task Dictionary Learning', IJCAI'15 Proceedings of the 24th International Conference on Artificial Intelligence, International Joint Conference on Artificial Intelligence, AAAI Press, Buenos Aires, pp. 2162-2168.View/Download from: UTS OPUS
Nie, LQ, Zhang, LM, Yang, Y, Wang, M, Hong, R & Chua, TS 2015, 'Beyond Doctors: Future Health Prediction from Multimedia and Multimodal Observations', Proceedings of the 2015 ACM Multimedia Conference, ACM International Conference on Multimedia, ACM, Brisbane, Australia, pp. 591-600.View/Download from: UTS OPUS or Publisher's site
Although chronic diseases cannot be cured, they can be effectively controlled as long as we understand their progressions based on the current observational health records, which is often in the form of multimedia data. A large and growing body of literature has investigated the disease progression problem. However, far too little attention to date has been paid to jointly consider the following three observations of the chronic disease progression: 1) the health statuses at different time points are chronologically similar; 2) the future health statuses of each patient can be comprehensively revealed from the current multimedia and multimodal observations, such as visual scans, digital measurements and textual medical histories; and 3) the discriminative capabilities of different modalities vary significantly in accordance to specific diseases. In the light of these, we propose an adaptive multimodal multi-task learning model to co-regularize the modality agreement, temporal progression and discriminative capabilities of different modalities. We theoretically show that our proposed model is a linear system. Before training our model, we address the data missing problem via the matrix factorization approach. Extensive evaluations on a real-world Alzheimer's disease dataset well verify our proposed model. It should be noted that our model is also applicable to other chronic diseases.
Wu, Song, Yang, Y, Li, Zhang & Zhuang 2015, 'Structured Embedding via Pairwise Relations and Long-Range Interactions in Knowledge Base', Proceedings of the National Conference on Artificial Intelligence, AAAI Conference on Artificial Intelligence, AAAI, Austin, Texas, USA.View/Download from: UTS OPUS
We consider the problem of embedding entities and relations of knowledge bases into low-dimensional continuous vector spaces (distributed representations). Unlike most existing approaches, which are primarily efficient for modelling pairwise relations between entities, we attempt to explicitly model both pairwise relations and long-range interactions between entities, by interpreting them as linear operators on the low-dimensional embeddings of the entities. Therefore, in this paper we introduces Path-Ranking to capture the long-range interactions of knowledge graph and at the same time preserve the pairwise relations of knowledge graph; we call it 'structured embedding via pairwise relation and long-range interactions' (referred to as SePLi). Comparing with the-state-of-the-art models, SePLi achieves better performances of embeddings.
Xu, ZW, Yang, Y & Hauptmann, AG 2015, 'A Discriminative CNN Video Representation for Event Detection', 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Boston, USA.View/Download from: UTS OPUS or Publisher's site
In this paper, we propose a discriminative video representation for event detection over a large scale video dataset when only limited hardware resources are available. The focus of this paper is to effectively leverage deep Convolutional Neural Networks (CNNs) to advance event detection, where only frame level static descriptors can be extracted by the existing CNN toolkits. This paper makes two contributions to the inference of CNN video representation. First, while average pooling and max pooling have long been the standard approaches to aggregating frame level static features, we show that performance can be significantly improved by taking advantage of an appropriate encoding method. Second, we propose using a set of latent concept descriptors as the frame descriptor, which enriches visual information while keeping it computationally affordable. The integration of the two contributions results in a new state-of-the-art performance in event detection over the largest video datasets. Compared to improved Dense Trajectories, which has been recognized as the best video representation for event detection, our new representation improves the Mean Average Precision (mAP) from 27.6% to 36.8% for the TRECVID MEDTest 14 dataset and from 34.0% to 44.6% for the TRECVID MEDTest 13 dataset.
Yan, Y, Yang, Y, Shen, H, Meng, D, Liu, GW, Hauptmann, AG & Sebe, N 2015, 'Complex Event Detection via Event Oriented Dictionary Learning', Proceedings of the 29th AAAI Conference on Artificial Intelligence, AAAI Conference on Artificial Intelligence, AAAI, Austin, USA, pp. 3841-3847.View/Download from: UTS OPUS
Complex event detection is a retrieval task with the goal of finding videos of a particular event in a large-scale unconstrained internet video archive, given example videos and text descriptions. Nowadays, different multimodal fusion schemes of low-level and high-level features are extensively investigated and evaluated for the complex event detection task. However, how to effectively select the high-level semantic meaningful concepts from a large pool to assist complex event detection is rarely studied in the literature. In this paper, we propose two novel strategies to automatically select semantic meaningful concepts for the event detection task based on both the events-kit text descriptions and the concepts high-level feature descriptions. Moreover, we introduce a novel event oriented dictionary representation based on the selected semantic concepts. Towards this goal, we leverage training samples of selected concepts from the Semantic Indexing (SIN) dataset with a pool of 346 concepts, into a novel supervised multi-task dictionary learning framework. Extensive experimental results on TRECVID Multimedia Event Detection (MED) dataset demonstrate the efficacy of our proposed method.
Yang, Y, Yu, SI, Jiang, L, Xu, ZW & Hauptmann, AG 2015, 'Content-Based Video Search over 1 Million Videos with 1 Core in 1 Second', Proceedings of the 5th ACM on International Conference on Multimedia Retrieval, International Conference on Multimedia Retrieval, ACM, Shanghai, China, pp. 419-426.View/Download from: UTS OPUS or Publisher's site
Many content-based video search (CBVS) systems have been proposed to analyze the rapidly-increasing amount of user-generated videos on the Internet. Though the accuracy of CBVS systems have drastically improved, these high accuracy systems tend to be too inefficient for interactive search. Therefore, to strive for real-time web-scale CBVS, we perform a comprehensive study on the different components in a CBVS system to understand the trade-offs between accuracy and speed of each component. Directions investigated include exploring different low-level and semantics-based features, testing different compression factors and approximations during video search, and understanding the time v.s. accuracy trade-off of reranking. Extensive experiments on data sets consisting of more than 1,000 hours of video showed that through a combination of effective features, highly compressed representations, and one iteration of reranking, our proposed system can achieve an 10,000-fold speedup while retaining 80% accuracy of a state-of-the-art CBVS system. We further performed search over 1 million videos and demonstrated that our system can complete the search in 0.975 seconds with a single core, which potentially opens the door to interactive web-scale CBVS for the general public.
Zhang, L, Yang, Y & Zimmermann, R 2015, 'Fine-grained image categorization by localizing tiny object parts from unannotated images', Proceedings of the 5th ACM on International Conference on Multimedia Retrieval, Annual ACM International Conference on Multimedia Retrieval (ICMR), ACM, Shanghai, China, pp. 107-114.View/Download from: UTS OPUS or Publisher's site
This paper proposes a novel fine-grained image categorization model where no object annotation is required in the training/testing stage. The key technique is a dense graph mining algorithm that localizes multi-scale discriminative object parts in each image. In particular, to mimick human hierarchical perception mechanism, a super-pixel pyramid is generated for each image, based on which graphlets from each layer are constructed to seamlessly describe object parts. We observe that graphlets representative to each category are densely distributed in the feature space. Therefore a dense graph mining algorithm is developed to discover graphlets representative to each sub- super-category. Finally, the discovered graphlets from pairwise images are encoded into an image kernel for fine-grained recognition. Experiments on the UCB-200  shown that our method performs competitively to many models relying on the annotated bird parts.
Xu, Z, Tsang, W, Yang, Y, Ma, Z & Hauptmann, AG 2014, 'Event Detection using Multi-Level Relevance Labels and Multiple Features', 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Columbus, OH, pp. 97-104.View/Download from: UTS OPUS or Publisher's site
We address the challenging problem of utilizing related exemplars for complex event detection while multiple features are available. Related exemplars share certain positive elements of the event, but have no uniform pattern due to the huge variance of relevance levels among different related exemplars. None of the existing multiple feature fusion methods can deal with the related exemplars. In this paper, we propose an algorithm which adaptively utilizes the related exemplars by cross-feature learning. Ordinal labels are used to represent the multiple relevance levels of the related videos. Label candidates of related exemplars are generated by exploring the possible relevance levels of each related exemplar via a cross-feature voting strategy. Maximum margin criterion is then applied in our framework to discriminate the positive and negative exemplars, as well as the related exemplars from different relevance levels. We test our algorithm using the large scale TRECVID 2011 dataset and it gains promising performance.
Ballas, N, Yang, Y, Lan, ZZ, Delezoide, B, Preteux, F & Hauptmann, A 2014, 'Space-time robust representation for action recognition', Proceedings of the IEEE International Conference on Computer Vision, IEEE International Conference on Computer Vision, IEEE, Sydney, NSW, Australia, pp. 2704-2711.View/Download from: UTS OPUS or Publisher's site
We address the problem of action recognition in unconstrained videos. We propose a novel content driven pooling that leverages space-time context while being robust toward global space-time transformations. Being robust to such transformations is of primary importance in unconstrained videos where the action localizations can drastically shift between frames. Our pooling identifies regions of interest using video structural cues estimated by differ ent saliency functions. To combine the different structural information, we introduce an iterative structure learning algorithm, WSVM (weighted SVM), that determines the optimal saliency layout of an action model through a sparse regularizer. A new optimization method is proposed to solve the WSVM' highly non-smooth objective function. We evaluate our approach on standard action datasets (KTH, UCF50 and HMDB). Most noticeably, the accuracy of our algorithm reaches 51.8% on the challenging HMDB dataset which outperforms the state-of-the-art of 7.3% relatively. © 2013 IEEE.
Chang, Nie, Yang, Y & Huang 2014, 'A Convex Formulation for Semi-supervised Multi-Label Feature Selection', Proceedings of the Twenty-Eigth AAAI Conference on Artificial Intelligence, AAAI Conference on Artificial Intelligence, AAAI Press, Canada.View/Download from: UTS OPUS
Gao, C, Yang, Y, Liu, G, Meng, D, Cai, Y, Xu, S, Tong, W, Shen, H & Hauptmann, AG 2014, 'Interactive surveillance event detection through mid-level discriminative representation', Proceedings of the ACM International Conference on Multimedia Retrieval 2014, ACM International Conference on Multimedia Retrieval, ACM, Scotland, pp. 305-312.View/Download from: UTS OPUS or Publisher's site
Event detection from real surveillance videos with complicated background environment is always a very hard task. Different from the traditional retrospective and interactive systems designed on this task, which are mainly executed on video fragments located within the event-occurrence time, in this paper we propose a new interactive system constructed on the mid-level discriminative representations (patches/ shots) which are closely related to the event (might occur beyond the event-occurrence period) and are easier to be detected than video fragments. By virtue of such easilydistinguished mid-level patterns, our framework realizes an effective labor division between computers and human participants. The task of computers is to train classifiers on a bunch of mid-level discriminative representations, and to sort all the possible mid-level representations in the evaluation sets based on the classifier scores. The task of human participants is then to readily search the events based on the clues offered by these sorted mid-level representations. For computers, such mid-level representations, with more concise and consistent patterns, can be more accurately detected than video fragments utilized in the conventional framework, and on the other hand, a human participant can always much more easily search the events of interest implicated by these location-anchored mid-level representations than conventional video fragments containing entire scenes. Both of these two properties facilitate the availability of our framework in real surveillance event detection applications. Copyright is held by the owner/author(s).
Jiang, L, Miao, Y, Yang, Y, Lan, Z & Hauptmann, AG 2014, 'Viral video style: A closer look at viral videos on YouTube', Proceedings of the ACM International Conference on Multimedia Retrieval 2014, ACM International Conference on Multimedia Retrieval, ACM, Scotland, pp. 193-200.View/Download from: UTS OPUS or Publisher's site
Viral videos that gain popularity through the process of Internet sharing are having a profound impact on society. Existing studies on viral videos have only been on small or confidential datasets. We collect by far the largest open benchmark for viral video study called CMU Viral Video Dataset, and share it with researchers from both academia and industry. Having verified existing observations on the dataset, we discover some interesting characteristics of viral videos. Based on our analysis, in the second half of the paper, we propose a model to forecast the future peak day of viral videos. The application of our work is not only important for advertising agencies to plan advertising campaigns and estimate costs, but also for companies to be able to quickly respond to rivals in viral marketing campaigns. The proposed method is unique in that it is the first attempt to incorporate video metadata into the peak day prediction. The empirical results demonstrate that the proposed method outperforms the state-of-the-art methods, with statistically significant differences. Copyright 2014 ACM.
Lan, ZZ, Yang, Y, Ballas, N, Yu, SI & Haputmann, A 2014, 'Resource constrained multimedia event detection', Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), International Conference on Multimedia Modeling, pp. 388-399.View/Download from: Publisher's site
We present a study comparing the cost and efficiency tradeoffs of multiple features for multimedia event detection. Low-level as well as semantic features are a critical part of contemporary multimedia and computer vision research. Arguably, combinations of multiple feature sets have been a major reason for recent progress in the field, not just as a low dimensional representations of multimedia data, but also as a means to semantically summarize images and videos. However, their efficacy for complex event recognition in unconstrained videos on standardized datasets has not been systematically studied. In this paper, we evaluate the accuracy and contribution of more than 10 multi-modality features, including semantic and low-level video representations, using two newly released NIST TRECVID Multimedia Event Detection (MED) open source datasets, i.e. MEDTEST and KINDREDTEST, which contain more than 1000 hours of videos. Contrasting multiple performance metrics, such as average precision, probability of missed detection and minimum normalized detection cost, we propose a framework to balance the trade-off between accuracy and computational cost. This study provides an empirical foundation for selecting feature sets that are capable of dealing with large-scale data with limited computational resources and are likely to produce superior multimedia event detection accuracy. This framework also applies to other resource limited multimedia analyses such as selecting/fusing multiple classifiers and different representations of each feature set. © 2014 Springer International Publishing.
Li, Li, Wang, Yang, Y, Zhang & Zhou 2014, 'Overcoming Semantic Drift in Information Extraction', Proc. 17th International Conference on Extending Database Technology (EDBT), Extending Database Technology, ACM, Greece.View/Download from: UTS OPUS or Publisher's site
Ma, Yang, Y, Sebe & Hauptmann 2014, 'Multiple Features But Few Labels? A Symbiotic Solution Exemplified for Video Analysis', MM '14 Proceedings of the 22nd ACM international conference on Multimedia, ACM International Conference on Multimedia, ACM, Orlando, FL, USA.View/Download from: UTS OPUS or Publisher's site
Peng, Meng, Xu, Gao, Yang, Y & Zhang 2014, 'Decomposable Nonlocal Tensor Dictionary Learning for Multispectral Image Denoising', 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Columbus, Ohio, USA.View/Download from: UTS OPUS or Publisher's site
As compared to the conventional RGB or gray-scale images, multispectral images (MSI) can deliver more faithful representation for real scenes, and enhance the performance of many computer vision tasks. In practice, however, an MSI is always corrupted by various noises. In this paper we propose an effective MSI denoising approach by combinatorially considering two intrinsic characteristics underlying an MSI: the nonlocal similarity over space and the global correlation across spectrum. In specific, by explicitly considering spatial self-similarity of an MSI we construct a nonlocal tensor dictionary learning model with a group-block-sparsity constraint, which makes similar full-band patches (FBP) share the same atoms from the spatial and spectral dictionaries. Furthermore, through exploiting spectral correlation of an MSI and assuming over-redundancy of dictionaries, the constrained nonlocal MSI dictionary learning model can be decomposed into a series of unconstrained low-rank tensor approximation problems, which can be readily solved by off-the-shelf higher order statistics. Experimental results show that our method outperforms all state-of-the-art MSI denoising methods under comprehensive quantitative performance measures.
Xu, Ye, Li, Liu, Yang, Y & Ding 2014, 'Dynamic Background Learning through Deep Auto-encoder Networks', MM '14 Proceedings of the 22nd ACM international conference on Multimedia, ACM International Conference on Multimedia, ACM, Orlando, FL, USA.View/Download from: UTS OPUS or Publisher's site
Xu, Z, Yang, Y, Kassim, A & Yan, S 2014, 'Cross-media relevance mining for evaluating text-based image search engine', 2014 IEEE International Conference on Multimedia and Expo Workshops, ICMEW 2014, IEEE International Conference on Multimedia and Expo Workshops, IEEE, Chengdu, China.View/Download from: UTS OPUS or Publisher's site
© 2014 IEEE. Targeted at MSR-Bing Image Retrieval grand challenge, we accumulate the experience from the one in 2013, and the make further investigation into different models to solve the relevance assessment problem. Generally speaking, the assessment of relevance between text query and image list is very hard due to the semantic gap. It's not easy to find the 'mapping' from text query into the visual world. Solutions from 2013 MSR-Bing grand challenge are discussed in this paper. Combining with our own observation, we give some insights into why some solutions work well, while others not. Our main solution is to combine the deep learning features with the wining solution of last year (average similarity measurement and page rank), since the deep learning features have more compact representation than the traditional BoWs features, and deep learning features are efficient (on a descent GPU) with very good performance. Our solution achieved the 1st place in MSR-Bing grand challenge 2014. Finally, we give the running time of our solution in the testing phase for the 2014 ICME testing set and development set, respectively.
Yang, Y, Ma, Z, Xu, Z, Yan, S & Hauptmann, AG 2014, 'How related exemplars help complex event detection in web videos?', Proceedings of the IEEE International Conference on Computer Vision, IEEE International Conference on Computer Vision, IEEE, Sydney, NSW, Australia, pp. 2104-2111.View/Download from: UTS OPUS or Publisher's site
Compared to visual concepts such as actions, scenes and objects, complex event is a higher level abstraction of longer video sequences. For example, a 'marriage proposal' event is described by multiple objects (e.g., ring, faces), scenes (e.g., in a restaurant, outdoor) and actions (e.g., kneeling down). The positive exemplars which exactly convey the precise semantic of an event are hard to obtain. It would be beneficial to utilize the related exemplars for complex event detection. However, the semantic correlations between related exemplars and the target event vary substantially as relatedness assessment is subjective. Two related exemplars can be about completely different events, e.g., in the TRECVID MED dataset, both bicycle riding and equestrianism are labeled as related to 'attempting a bike trick' event. To tackle the subjectiveness of human assessment, our algorithm automatically evaluates how positive the related exemplars are for the detection of an event and uses them on an exemplar-specific basis. Experiments demonstrate that our algorithm is able to utilize related exemplars adaptively, and the algorithm gains good performance for complex event detection. © 2013 IEEE.
Yang, Y, Shen, Yu, Meng & Hauptmann 2014, 'Unsupervised Video Adaptation for Parsing Human Motion', Lecture Notes in Computer Science - Computer Vision – ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, European Conference on Computer Vision, Springer, Switzerland.View/Download from: UTS OPUS or Publisher's site
Yu, Z, Wu, F, Yang, Y, Tian, Q, Luo, J & Zhuang, Y 2014, 'Discriminative coupled dictionary hashing for fast cross-media retrieval', SIGIR 2014 - Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM International Conference on Research and Development in Information Retrieval, ACM, Australia, pp. 395-404.View/Download from: UTS OPUS or Publisher's site
Cross-media hashing, which conducts cross-media retrieval by embedding data from different modalities into a common low-dimensional Hamming space, has attracted intensive attention in recent years. The existing cross-media hashing approaches only aim at learning hash functions to preserve the intra-modality and inter-modality correlations, but do not directly capture the underlying semantic information of the multi-modal data. We propose a discriminative coupled dictionary hashing (DCDH) method in this paper. In DCDH, the coupled dictionary for each modality is learned with side information (e.g., categories). As a result, the coupled dictionaries not only preserve the intra-similarity and inter-correlation among multi-modal data, but also contain dictionary atoms that are semantically discriminative (i.e., the data from the same category is reconstructed by the similar dictionary atoms). To perform fast cross-media retrieval, we learn hash functions which map data from the dictionary space to a low-dimensional Hamming space. Besides, we conjecture that a balanced representation is crucial in cross-media retrieval. We introduce multi-view features on the relatively "weak" modalities into DCDH and extend it to multiview DCDH (MV-DCDH) in order to enhance their representation capability. The experiments on two real-world data sets show that our DCDH and MV- DCDH outperform the state-of-the-art methods significantly on cross-media retrieval. Copyright 2014 ACM.
Yu, Z, Zhang, Y, Tang, S, Yang, Y, Tian, Q & Luo, J 2014, 'CROSS-MEDIA HASHING WITH KERNEL REGRESSION', 2014 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), IEEE International Conference on Multimedia and Expo Workshops (ICMEW), IEEE, Chengdu, PEOPLES R CHINA.
Zhang, L, Yang, Y & Zimmermann, R 2014, 'Discriminative cellets discovery for fine-grained image categories retrieval', Proceedings of the ACM International Conference on Multimedia Retrieval 2014, ACM International Conference on Multimedia Retrieval, ACM, Scotland, pp. 57-64.View/Download from: UTS OPUS or Publisher's site
Fine-grained image categories recognition is a challenging task aiming at distinguishing objects belonging to the same basic-level category, such as leaf or mushroom. It is a useful technique that can be applied for species recognition, face verification, and etc. Most of the existing methods have difficulties to automatically detect discriminative object components. In this paper, we propose a new fine- grained image categorization model that can be deemed as an improved version spatial pyramid matching (SPM). In- stead of the conventional SPM that enumeratively conducts cell-to-cell matching between images, the proposed model combines multiple cells into cellets that are highly responsive to object fine-grained categories. In particular, we describe object components by cellets that connect spatially adjacent cells from the same pyramid level. Straightforwardly, image categorization can be casted as the matching between cellets extracted from pairwise images. Toward an effective matching process, a hierarchical sparse coding algorithm is derived that represents each cellet by a linear combination of the basis cellets. Further, a linear discriminant analy- sis (LDA)-like scheme is employed to select the cellets with high discrimination. On the basis of the feature vector built from the selected cellets, fine-grained image categorization is conducted by training a linear SVM. Experimental results on the Caltech-UCSD birds, the Leeds butterflies, and the COSMIC insects data sets demonstrate our model out- performs the state-of-the-art. Besides, the visualized cellets show discriminative object parts are localized accurately. Copyright 2014 ACM.
Xu, Z, Yang, Y, Tsang, I, Hauptmann, A & Sebe, N 2013, 'Feature Weighting via Optimal Thresholding for Video Analysis', Proceedings of the 2013 IEEE International Conference on Computer Vision, IEEE International Conference on Computer Vision, IEEE, Sydney, NSW, Australia, pp. 3340-3447.View/Download from: UTS OPUS or Publisher's site
Fusion of multiple features can boost the performance of large-scale visual classification and detection tasks like TRECVID Multimedia Event Detection (MED) competition . In this paper, we propose a novel feature fusion approach, namely Feature Weighting via Optimal Thresholding (FWOT) to effectively fuse various features. FWOT learns the weights, thresholding and smoothing parameters in a joint framework to combine the decision values obtained from all the individual features and the early fusion. To the best of our knowledge, this is the first work to consider the weight and threshold factors of fusion problem simultaneously. Compared to state-of-the-art fusion algorithms, our approach achieves promising improvements on HMDB  action recognition dataset and CCV  video classification dataset. In addition, experiments on two TRECVID MED 2011 collections show that our approach outperforms the state-of-the-art fusion methods for complex event detection.
Cai, Y, Yang, Y, Hauptmann, AG & Wactlar, HD 2013, 'A cognitive assistive system for monitoring the use of home medical devices', MIIRH 2013 - Proceedings of the 1st ACM International Workshop on Multimedia Indexing and Information Retrieval for Heathcare, Co-located with ACM Multimedia 2013, ACM International Conference on Multimedia, ACM, Barcelona, Spain, pp. 59-66.View/Download from: UTS OPUS or Publisher's site
Despite the popularity of home medical devices, serious safety concerns have been raised, because the use-errors of home medical devices have linked to a large number of fatal hazards. To resolve the problem, we introduce a cognitive assistive system to automatically monitor the use of home medical devices. Being able to accurately recognize user operations is one of the most important functionalities of the proposed system. However, even though various action recognition algorithms have been proposed in recent years, it is still unknown whether they are adequate for recognizing operations in using home medical devices. Since the lack of the corresponding database is the main reason causing the situation, at the first part of this paper, we present a database specially designed for studying the use of home medical devices. Then, we evaluate the performance of the existing approaches on the proposed database. Although using state-of-art approaches which have demonstrated near perfect performance in recognizing certain general human actions, we observe significant performance drop when applying it to recognize device operations. We conclude that the tiny action involved in using devices is one of the most important reasons leading to the performance decrease. To accurately recognize tiny actions, it's critical to focus on where the target action happens, namely the region of interest(ROI) and have more elaborate action modeling based on the ROI. Therefore, in the second part of this paper, we introduce a simple but effective approach to estimating ROI for recognizing tiny actions. The key idea of this method is to analyze the correlation between an action and the sub-regions of a frame. The estimated ROI is then used as a filter for building more accurate action representations. Experimental results show significant performance improvements over the baseline methods by using the estimated ROI for action recognition. © 2013 ACM.
Cao, X, Wei, X, Han, Y, Yang, Y & Lin, D 2013, 'Robust tensor clustering with non-greedy maximization', IJCAI International Joint Conference on Artificial Intelligence, International Joint Conference on Artificial Intelligence, AAAI Press, Beijing China, pp. 1254-1259.View/Download from: UTS OPUS
Tensors are increasingly common in several areas such as data mining, computer graphics, and computer vision. Tensor clustering is a fundamental tool for data analysis and pattern discovery. However, there usually exist outlying data points in realworld datasets, which will reduce the performance of clustering. This motivates us to develop a tensor clustering algorithm that is robust to the outliers. In this paper, we propose an algorithm of Robust Tensor Clustering (RTC). The RTC firstly finds a lower rank approximation of the original tensor data using a L1 norm optimization function. Because the L1 norm doesn't exaggerate the effect of outliers compared with L2 norm, the minimization of the L1 norm approximation function makes RTC robust to outliers. Then we compute the HOSVD decomposition of this approximate tensor to obtain the final clustering results. Different from the traditional algorithm solving the approximation function with a greedy strategy, we utilize a non-greedy strategy to obtain a better solution. Experiments demonstrate that RTC has better performance than the state-ofthe- art algorithms and is more robust to outliers.
Chen, MY, Hauptmann, A, Bharucha, A, Wactlar, H & Yang, Y 2011, 'Human activity analysis for geriatric care in nursing homes', The Era of Interactive Media, Pacific-Rim Conference on Multimedia, Springer, Sydney, Australia, pp. 53-61.View/Download from: UTS OPUS or Publisher's site
© 2013 Springer Science+Business Media, LLC. All rights reserved. As our society is increasingly aging, it is urgent to develop computer aided techniques to improve the quality-of-care (QoC) and quality-of-life (QoL) of geriatric patients. In this paper, we focus on automatic human activities analysis in video surveillance recorded in complicated environments at a nursing home. This will enable the automatic exploration of the statistical patterns between patients' daily activities and their clinical diagnosis. We also discuss potential future research directions in this area. Experiment demonstrate the proposed approach is effective for human activity analysis.
Han, Yang, Y & Zhou 2013, 'Co-Regularized Ensemble for Feature Selection', Proceedings of the 23rd International Joint Conference on Artificial Intelligence (IJCAI 13), International Joint Conference on Artificial Intelligence, International Joint Conferences on Artificial Intelligence, Beijing, China.View/Download from: UTS OPUS
Ma, Yang, Y, Nie & Sebe 2013, 'Thinking of Images as What They Are: Compound Matrix Regression for Image Classification', Proceedings of the 23rd International Joint Conference on Artificial Intelligence (IJCAI 13), International Joint Conference on Artificial Intelligence, International Joint Conferences on Artificial Intelligence, Beijing, China.View/Download from: UTS OPUS
Ma, Yang, Y, Xu, Sebe & Hauptmann 2013, 'We Are Not Equally Negative: Fine-grained Labeling for Multimedia Event Detection', MM '13 Proceedings of the 21st ACM international conference on Multimedia, ACM International Conference on Multimedia, ACM, Barcelona, Spain.View/Download from: UTS OPUS or Publisher's site
Ma, Yang, Y, Xu, Sebe, Yan & Hauptmann 2013, 'Complex Event Detection via Multi-Source Video Attributes', 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Portland, OR, USA.View/Download from: UTS OPUS or Publisher's site
Song, J, Yang, Y, Yang, Y, Huang, Z & Shen, HT 2013, 'Inter-media hashing for large-scale retrieval from heterogeneous data sources', Proceedings of the ACM SIGMOD International Conference on Management of Data, ACM Special Interest Group on Management of Data Conference, ACM, New York, New York, USA, pp. 785-796.View/Download from: UTS OPUS or Publisher's site
In this paper, we present a new multimedia retrieval paradigm to innovate large-scale search of heterogenous multimedia data. It is able to return results of different media types from heterogeneous data sources, e.g., using a query image to retrieve relevant text documents or images from different data sources. This utilizes the widely available data from different sources and caters for the current users' demand of receiving a result list simultaneously containing multiple types of data to obtain a comprehensive understanding of the query's results. To enable large-scale inter-media retrieval, we propose a novel inter-media hashing (IMH) model to explore the correlations among multiple media types from different data sources and tackle the scalability issue. To this end, multimedia data from heterogeneous data sources are transformed into a common Hamming space, in which fast search can be easily implemented by XOR and bit-count operations. Furthermore, we integrate a linear regression model to learn hashing functions so that the hash codes for new data points can be efficiently generated. Experiments conducted on real-world large-scale multimedia datasets demonstrate the superiority of our proposed method compared with state-of-the-art techniques. Copyright © 2013 ACM.
Wang, S, Xu, Z, Yang, Y, Li, X, Pang, C & Haumptmann, AG 2013, 'Fall detection in multi-camera surveillance videos: Experimentations and observations', MIIRH 2013 - Proceedings of the 1st ACM International Workshop on Multimedia Indexing and Information Retrieval for Heathcare, Co-located with ACM Multimedia 2013, pp. 33-38.View/Download from: Publisher's site
This paper presents our study on fall detection for ageing care monitoring. We collected a choreographed multi-camera dataset that contains fall actions and other actions such as walking, standing up, sitting down and so forth. In our work, MoSIFT feature is extracted from the videos recorded by each camera. We conduct a series of experiments to show the performance variations of fall detection when different methods are used. We first compare the performance of the standard Bag-of-Words and spatial Bag-of-Words with different codebook sizes. Then, we test different fusion methods which combines the information from the videos recorded by two orthogonally deployed cameras, where a non-linear χ2 kernel Support Vector Machine (SVM) is trained to detect fall actions. In addition, we also use explicit feature maps along with linear kernel for fall detection and compare it to the standard bag of word representation with a non-linear χ2 kernel. Our experiment results show that late fusion of Bag-of-Words with a 1000 centers codebook obtains the best performance. The best result reaches 90.46% in average precision, which in turn may provide a more independent and safer living environment for the elderly. © 2013 ACM.
Wu, F, Tan, X, Yang, Y, Tao, D, Tang, S & Zhuang, Y 2013, 'Supervised Nonnegative Tensor Factorization with Maximum-Margin Constraint', Twenty-Seventh AAAI Conference on Artificial Intelligence, AAAI Conference on Artificial Intelligence, AAAI Press, Bellevue, Washington, USA, pp. 962-968.View/Download from: UTS OPUS
Non-negative tensor factorization (NTF) has attracted great attention in the machine learning community. In this paper, we extend traditional non-negative tensor factorization into a supervised discriminative decomposition, referred as Supervised Non-negative Tensor Factorization with Maximum-Margin Constraint(SNTFM2). SNTFM2 formulates the optimal discriminative factorization of non-negative tensorial data as a coupled least-squares optimization problem via a maximum-margin method. As a result, SNTFM2 not only faithfully approximates the tensorial data by additive combinations of the basis, but also obtains a strong generalization power to discriminative analysis (in particularfor classification in this paper). The experimental results show the superiority of our proposed model over state-of-the-art techniques on both toy and real world data sets.
Yu, Yang, Y & Hauptmann 2013, 'Harry Potter's Marauder's Map: Localizing and Tracking Multiple Persons-of-Interest by Nonnegative Discretization', 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Portland, USA.View/Download from: UTS OPUS or Publisher's site
Zheng, Shang, Yuan & Yang, Y 2013, 'Towards Efficient Search for Activity Trajectories', 2013 IEEE 29TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE), IEEE International Conference on Data Engineering, IEEE, Brisbane, QLD, Australia.View/Download from: UTS OPUS or Publisher's site
The advances in location positioning and wireless communication technologies have led to a myriad of spatial trajectories representing the mobility of a variety of moving objects. While processing trajectory data with the focus of spatio-temporal features has been widely studied in the last decade, recent proliferation in location-based web applications (e.g., Foursquare, Facebook) has given rise to large amounts of trajectories associated with activity information, called activity trajectory. In this paper, we study the problem of efficient similarity search on activity trajectory database. Given a sequence of query locations, each associated with a set of desired activities, an activity trajectory similarity query (ATSQ) returns k trajectories that cover the query activities and yield the shortest minimum match distance. An order-sensitive activity trajectory similarity query (OATSQ) is also proposed to take into account the order of the query locations. To process the queries efficiently, we firstly develop a novel hybrid grid index, GAT, to organize the trajectory segments and activities hierarchically, which enables us to prune the search space by location proximity and activity containment simultaneously. In addition, we propose algorithms for efficient computation of the minimum match distance and minimum order-sensitive match distance, respectively. The results of our extensive empirical studies based on real online check-in datasets demonstrate that our proposed index and methods are capable of achieving superior performance and good scalability.
Cao, L, Ji, R, Gao, Y, Yang, Y & Tian, Q 2012, 'Weakly supervised sparse coding with geometric consistency pooling', Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Providence, RI, USA, pp. 3578-3585.View/Download from: UTS OPUS or Publisher's site
Most recently the Bag-of-Features (BoF) representation has been well advocated for image search and classification, with two decent phases named sparse coding and max pooling to compensate quantization loss as well as inject spatial layouts. But still, much information has been discarded by quantizing local descriptors with two-dimensional layouts into a one-dimensional BoF histogram. In this paper, we revisit this popular sparse coding max pooling paradigm by looking around the local descriptor context towards an optimal BoF. First, we introduce a Weakly supervised Sparse Coding (WSC) to exploit the Classemes-based attribute labeling to refine the descriptor coding procedure. It is achieved by learning an attribute-to-word co-occurrence prior to impose a label inconsistency distortion over the 1 based coding regularizer, such that the descriptor codes can maximally preserve the image semantic similarity. Second, we propose an adaptive feature pooling scheme over superpixels rather than over fixed spatial pyramids, named Geometric Consistency Pooling (GCP). As an effect, local descriptors enjoying good geometric consistency are pooled together to ensure a more precise spatial layouts embedding in BoF. Both of our phases are unsupervised, which differ from the existing works in supervised dictionary learning, sparse coding and feature pooling. Therefore, our approach enables potential applications like scalable visual search. We evaluate in both image classification and search benchmarks and report good improvements over the state-of-the-arts. © 2012 IEEE.
Li, Z, Yang, Y, Liu, J, Zhou, X & Lu, H 2012, 'Unsupervised feature selection using nonnegative spectral analysis', Proceedings of the National Conference on Artificial Intelligence, pp. 1026-1032.View/Download from: UTS OPUS
In this paper, a new unsupervised learning algorithm, namely Nonnegative Discriminative Feature Selection (NDFS), is proposed. To exploit the discriminative information in unsupervised scenarios, we perform spectral clustering to learn the cluster labels of the input samples, during which the feature selection is performed simultaneously. The joint learning of the cluster labels and feature selection matrix enables NDFS to select the most discriminative features. To learn more accurate cluster labels, a nonnegative constraint is explicitly imposed to the class indicators. To reduce the redundant or even noisy features, ℓ 2,1-norm minimization constraint is added into the objective function, which guarantees the feature selection matrix sparse in rows. Our algorithm exploits the discriminative information and feature correlation simultaneously to select a better feature subset. A simple yet efficient iterative algorithm is designed to optimize the proposed objective function. Experimental results on different real world datasets demonstrate the encouraging performance of our algorithm over the state-of-the-arts. Copyright © 2012, Association for the Advancement of Artificial Intelligence. All rights reserved.
Ma, Z, Yang, Y, Cai, Y, Sebe, N & Hauptmann, AG 2012, 'Knowledge adaptation for ad hoc multimedia event detection with few exemplars', MM 2012 - Proceedings of the 20th ACM International Conference on Multimedia, ACM International Conference on Multimedia, IEEE, Institute of Electrical and Electronics Engineers, Nara, Japan, pp. 469-478.View/Download from: UTS OPUS or Publisher's site
Multimedia event detection (MED) has a significant impact on many applications. Though video concept annotation has received much research effort, video event detection remains largely unaddressed. Current research mainly focuses on sports and news event detection or abnormality detection in surveillance videos. Our research on this topic is capable of detecting more complicated and generic events. Moreover, the curse of reality, i.e., precisely labeled multimedia content is scarce, necessitates the study on how to attain respectable detection performance using only limited positive examples. Research addressing these two aforementioned issues is still in its infancy. In light of this, we explore Ad Hoc MED, which aims to detect complicated and generic events by using few positive examples. To the best of our knowledge, our work makes the first attempt on this topic. As the information from these few positive examples is limited, we propose to infer knowledge from other multimedia resources to facilitate event detection. Experiments are performed on real-world multimedia archives consisting of several challenging events. The results show that our approach outperforms several other detection algorithms. Most notably, our algorithm outperforms SVM by 43% and 14% comparatively in Average Precision when using Gaussian and χ 2 kernel respectively. © 2012 ACM.
Ma, Z, Yang, Y, Hauptmann, AG & Sebe, N 2012, 'Classifier-specific intermediate representation for multimedia tasks', Proceedings of the 2nd ACM International Conference on Multimedia Retrieval, ICMR 2012, ACM International Conference on Multimedia Retrieval, Association for Computing Machinery (ACM), Hong Kong, Hong Kong.View/Download from: UTS OPUS or Publisher's site
Video annotation and multimedia classification play important roles in many applications such as video indexing and retrieval. To improve video annotation and event detection, researchers have proposed using intermediate concept classifiers with concept lexica to help understand the videos. Yet it is difficult to judge how many and what concepts would be sufficient for the particular video analysis task. Additionally, obtaining robust semantic concept classifiers requires a large number of positive training examples, which in turn has high human annotation cost. In this paper, we propose an approach that is able to automatically learn an intermediate representation from video features together with a classifier. The joint optimization of the two components makes them mutually beneficial and reciprocal. Effectively, the intermediate representation and the classifier are tightly correlated. The classifier dependent intermediate representation not only accurately reflects the task semantics but is also more suitable for the specific classifier. Thus we have created a discriminative semantic analysis framework based on a tightly-coupled intermediate representation. Several experiments on video annotation and multimedia event detection using real-world videos demonstrate the effectiveness of the proposed approach. Copyright © 2012 ACM.
Wang, S, Yang, Y, Ma, Z, Li, X, Pang, C & Hauptmann, AG 2012, 'Action recognition by exploring data distribution and feature correlation', Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Providence, RI, USA, pp. 1370-1377.View/Download from: UTS OPUS or Publisher's site
Human action recognition in videos draws strong research interest in computer vision because of its promising applications for video surveillance, video annotation, interactive gaming, etc. However, the amount of video data containing human actions is increasing exponentially, which makes the management of these resources a challenging task. Given a database with huge volumes of unlabeled videos, it is prohibitive to manually assign specific action types to these videos. Considering that it is much easier to obtain a small number of labeled videos, a practical solution for organizing them is to build a mechanism which is able to conduct action annotation automatically by leveraging the limited labeled videos. Motivated by this intuition, we propose an automatic video annotation algorithm by integrating semi-supervised learning and shared structure analysis into a joint framework for human action recognition. We apply our algorithm on both synthetic and realistic video datasets, including KTH , CareMedia dataset  , Youtube action  and its extended version, UCF50  . Extensive experiments demonstrate that the proposed algorithm outperforms the compared algorithms for action recognition. Most notably, our method has a very distinct advantage over other compared algorithms when we have only a few labeled samples. © 2012 IEEE.
Yang, Y & Bao, C 2012, 'A novel discriminant locality preserving projections for MDM-based speaker classification', 3rd Global Congress on Intelligent Systems (GCIS) 2012, WRI Global Congress on Intelligent Systems (GCIS), IEEE, Wuhan, China, pp. 127-130.View/Download from: UTS OPUS or Publisher's site
Speaker classification is an important component for audio indexing technology for many applications such as multimedia conferencing. The primary input device of NIST speaker classification evaluation is Multiple Distant Microphones (MDM). MDM is composed of multiple microphones and has the merit of low price and easy-to-use. The spatial time-delay vector of MDM can be extracted as the speaker's discriminant feature. However the feature dimension will be expanded quickly with the increasing number of sensors. Locality Preserving Projections (LPP) and Discriminant locality preserving projection (DLPP) are the principal manifold dimension-reduction algorithms being proposed recently. In this paper, we proposed a novel method to overcome the drawbacks of traditional manifold algorithms such as the lack of class information or spatial identification information. Some basic concepts of spatial time-delay feature and merging feature for MDM speaker classification are first introduced. A review of known DLPP algorithm followed by Fisher criterion is given. Then the Multi-component Discriminant Locality Preserving Projections (MDLPP) method for speaker classification with MDM is described. Comparative experiment results on real meeting data showed the effectiveness of the proposed method.
Yang, Y, Hauptmann, A, Chen, MY, Cai, Y, Bharucha, A & Wactlar, H 2012, 'Learning to predict health status of geriatric patients from observational data', 2012 IEEE Symposium on Computational Intelligence and Computational Biology, CIBCB 2012, IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology, IEEE, Institute of Electrical and Electronics Engineers, San Diego, CA, USA, pp. 127-134.View/Download from: UTS OPUS or Publisher's site
Data for diagnosis and clinical studies are now typically gathered by hand. While more detailed, exhaustive behavioral assessments scales have been developed, they have the drawback of being too time consuming and manual assessment can be subjective. Besides, clinical knowledge is required for accurate manual assessment, for which extensive training is needed. Therefore our great research challenge is to leverage machine learning techniques to better understand patients health status automatically based on continuous computer observations. In this paper, we study the problem of health status prediction for geriatric patients using observational data. In the first part of this paper, we propose a distance metric learning algorithm to learn a Mahalanobis distance which is more precise for similarity measures. In the second part, we propose a robust classifier based on ℓ 2,1 -norm regression to predict the geriatric patients' health status. We test the algorithm on a dataset collected from a nursing home. Experiment shows that our algorithm achieves encouraging performance. © 2012 IEEE.
Yang, Y, Yang, Y, Huang, Z, Liu, J & Ma, Z 2012, 'Robust cross-media transfer for visual event detection', MM 2012 - Proceedings of the 20th ACM International Conference on Multimedia, ACM International Conference on Multimedia, Association for Computing Machinery (ACM), Nara, Japan, pp. 1045-1048.View/Download from: UTS OPUS or Publisher's site
In this paper, we present a novel approach, named Robust Cross-Media Transfer (RCMT), for visual event detection in social multimedia environments. Different from most existing methods, the proposed method can directly take different types of noisy social multimedia data as input and conduct robust event detection. More specifically, we build a robust model by employing an l 2,1-norm regression model featuring noise tolerance, and also manage to integrate different types of social multimedia data by minimizing the distribution difference among them. Experimental results on real-life Flickr image dataset and YouTube video dataset demonstrate the effectiveness of our proposal, compared to state-of-the-art algorithms. © 2012 ACM.
Yu, SI, Xu, Z, Ding, D, Sze, W, Vicente, F, Lan, Z, Cai, Y, Rawat, S, Schulam, P, Markandaiah, N, Bahmani, S, Juarez, A, Tong, W, Yang, Y, Burger, S, Metze, F, Singh, R, Raj, B, Stern, R, Mitamura, T, Nyberg, E & Hauptmann, A 2012, 'Informedia e-lamp @ TRECVID 2012 multimedia event detection and recounting MED and MER', 2012 TREC Video Retrieval Evaluation Notebook Papers.View/Download from: UTS OPUS
We report on our system used in the TRECVID 2012 Multimedia Event Detection (MED) and Multimedia Event Recounting (MER) tasks. For MED, it consists of three main steps: extracting features, training detectors and fusion. In the feature extraction part, we extract many low-level, high-level, and text features. Those features are then represented in three different ways which are spatial bag-of words with standard tiling, spatial bag-of-words with feature and event specific tiling and the Gaussian Mixture Model Super Vector. In the detector training and fusion, two classifiers and three fusion methods are employed. The results from both the official sources and our internal evaluations show good performance of our system. Our MER system utilizes a subset of features and detection results from the MED system from which the recounting is generated.
Ma, Z, Yang, Y, Nie, F, Uijlings, J & Sebe, N 2011, 'Exploiting the entire feature space with sparsity for automatic image annotation', MM'11 - Proceedings of the 2011 ACM Multimedia Conference and Co-Located Workshops, ACM International Conference on Multimedia, ACM, Scottsdale, Arizona, USA, pp. 283-292.View/Download from: UTS OPUS or Publisher's site
The explosive growth of digital images requires effective methods to manage these images. Among various existing methods, automatic image annotation has proved to be an important technique for image management tasks, e.g., image retrieval over large-scale image databases. Automatic image annotation has been widely studied during recent years and a considerable number of approaches have been proposed. However, the performance of these methods is yet to be satisfactory, thus demanding more effort on research of image annotation. In this paper, we propose a novel semi-supervised framework built upon feature selection for automatic image annotation. Our method aims to jointly select the most relevant features from all the data points by using a sparsity-based model and exploiting both labeled and unlabeled data to learn the manifold structure. Our framework is able to simultaneously learn a robust classifier for image annotation by selecting the discriminating features related to the semantic concepts. To solve the objective function of our framework, we propose an efficient iterative algorithm. Extensive experiments are performed on different realworld image datasets with the results demonstrating the promising performance of our framework for automatic image annotation. © 2011 ACM.
This paper describes the experimental framework of the University of Queensland's Multimedia Search Group (UQMSG) at TRECVID 2011. We participated in two tasks this year, both for the first time. For the semantic indexing task, we submitted four lite runs: L_A_UQMSG1_1, L_A_UQMSG2_2, L_A_UQMSG3_3 and L_A_UQMSG4_4. They are all of training type A (actually we only used IACC.1.tv10.training data), but with different parameter settings in our keyframe-based Laplacian Joint Group Lasso (LJGL) algorithm with Local Binary Patterns (LBP) feature. For the content-based copy detection task, we submitted two runs: UQMSG.m.nofa.mfh and UQMSG.m.balanced.mfh. They used only the video modality information of keyframes and were both based on our Multiple Feature Hashing (MFH) algorithm that fuses local (LBP) and global (HSV) visual features, with different application profiles (reducing the false alarm rate v.s. balancing false alarms and misses). Due to time constraint, we were not able to improve the performance of our systems adequately on all the available training data this year for these tasks. Evaluation results suggest that more efforts need to be made to well tune system parameters. In addition, sophisticated techniques beyond applying keyframe-level semantic concept propagation and near-duplicate detection are required for achieving better performance in video tasks.
Song, J, Yang, Y, Huang, Z, Shen, HT & Hong, R 2011, 'Multiple feature hashing for real-time large scale near-duplicate video retrieval', MM'11 - Proceedings of the 2011 ACM Multimedia Conference and Co-Located Workshops, ACM International Conference on Multimedia, ACM, Scottsdale, Arizona, USA, pp. 423-432.View/Download from: UTS OPUS or Publisher's site
Near-duplicate video retrieval (NDVR) has recently attracted lots of research attention due to the exponential growth of online videos. It helps in many areas, such as copyright protection, video tagging, online video usage monitoring, etc. Most of existing approaches use only a single feature to represent a video for NDVR. However, a single feature is often insufficient to characterize the video content. Besides, while the accuracy is the main concern in previous literatures, the scalability of NDVR algorithms for large scale video datasets has been rarely addressed. In this paper, we present a novel approach - Multiple Feature Hashing (MFH) to tackle both the accuracy and the scalability issues of NDVR. MFH preserves the local structure information of each individual feature and also globally consider the local structures for all the features to learn a group of hash functions which map the video keyframes into the Hamming space and generate a series of binary codes to represent the video dataset. We evaluate our approach on a public video dataset and a large scale video dataset consisting of 132,647 videos, which was collected from YouTube by ourselves. The experiment results show that the proposed method outperforms the state-of-the-art techniques in both accuracy and efficiency. © 2011 ACM.
Wang, H, Nie, F, Huang, H & Yang, Y 2011, 'Learning frame relevance for video classification', MM'11 - Proceedings of the 2011 ACM Multimedia Conference and Co-Located Workshops, ACM International Conference on Multimedia, ACM, Scottsdale, Arizona, USA, pp. 1345-1348.View/Download from: UTS OPUS or Publisher's site
Traditional video classification methods typically require a large number of labeled training video frames to achieve satisfactory performance. However, in the real world, we usually only have sufficient labeled video clips (such as tagged online videos) but lack labeled video frames. In this paper, we formalize the video classification problem as a Multi-Instance Learning (MIL) problem, an emerging topic in machine learning in recent years, which only needs bag (video clip) labels. To solve the problem, we propose a novel Parameterized Class-to-Bag (P-C2B) Distance method to learn the relative importance of a training instance with respect to its labeled classes, such that the instance level labeling ambiguity in MIL is tackled and the frame relevances of training video data with respect to the semantic concepts of interest are given. Promising experimental results have demonstrated the effectiveness of the proposed method. Copyright 2011 ACM.
Yang, Y, Shen, HT, Ma, Z, Huang, Z & Zhou, X 2011, 'L2,1-Norm Regularized Discriminative FeatureSelection for Unsupervised Learning', 22nd IJCAI International Joint Conference on Artificial Intelligence, International Joint Conference on Artificial Intelligence, AAAI Press, Barcelona, Catelonia, pp. 1589-1594.View/Download from: UTS OPUS or Publisher's site
Compared with supervised learning for feature selection, it is much more difficult to select the discriminative features in unsupervised learning due to the lack of label information. Traditional unsupervised feature selection algorithms usually select the features which best preserve the data distribution, e.g., manifold structure, of the whole feature set. Under the assumption that the class label of input data can be predicted by a linear classifier, we incorporate discriminative analysis and ℓ 2,1 -norm minimization into a joint framework for unsupervised feature selection. Different from existing unsupervised feature selection algorithms, our algorithm selects the most discriminative feature subset from the whole feature set in batch mode. Extensive experiment on different data types demonstrates the effectiveness of our algorithm.
Yang, Y, Shen, HT, Nie, F, Ji, R & Zhou, X 2011, 'Nonnegative spectral clustering with discriminative regularization', Proceedings of the Twenty-Fifth AAAI Conference on Artificial Intelligence, AAAI Conference on Artificial Intelligence, pp. 555-560.View/Download from: UTS OPUS
Clustering is a fundamental research topic in the field of data mining. Optimizing the objective functions of clustering algorithms, e.g. normalized cut and k-means, is an NP-hard optimization problem. Existing algorithms usually relax the elements of cluster indicator matrix from discrete values to continuous ones. Eigenvalue decomposition is then performed to obtain a relaxed continuous solution, which must be discretized. The main problem is that the signs of the relaxed continuous solution are mixed. Such results may deviate severely from the true solution, making it a nontrivial task to get the cluster labels. To address the problem, we impose an explicit nonnegative constraint for a more accurate solution during the relaxation. Besides, we additionally introduce a discriminative regularization into the objective to avoid overfitting. A new iterative approach is proposed to optimize the objective. We show that the algorithm is a general one which naturally leads to other extensions. Experiments demonstrate the effectiveness of our algorithm. Copyright © 2011, Association for the Advancement of Artificial Intelligence. All rights reserved.
Yang, Y, Yang, Y, Huang, Z & Shen, HT 2011, 'Transfer tagging from image to video', MM'11 - Proceedings of the 2011 ACM Multimedia Conference and Co-Located Workshops, ACM International Conference on Multimedia, ACM, Scottsdale, Arizona, USA, pp. 1137-1140.View/Download from: UTS OPUS or Publisher's site
Nowadays massive amount of web video datum has been emerging on the Internet. To achieve an effective and efficient video retrieval, it is critical to automatically assign semantic keywords to the videos via content analysis. However, most of the existing video tagging methods suffer from the problem of lacking sufficient tagged training videos due to high labor cost of manual tagging. Inspired by the observation that there are much more well-labeled data in other yet relevant types of media (e.g. images), in this paper we study how to build a "cross-media tunnel" to transfer external tag knowledge from image to video. Meanwhile, the intrinsic data structures of both image and video spaces are well explored for inferring tags. We propose a Cross-Media Tag Transfer (CMTT) paradigm which is able to: 1) transfer tag knowledge between image and video by minimizing their distribution difference; 2) infer tags by revealing the underlying manifold structures embedded within both image and video spaces. We also learn an explicit mapping function to handle unseen videos. Experimental results have been reported and analyzed to illustrate the superiority of our proposal. Copyright 2011 ACM.
Yang, Y, Yang, Y, Huang, Z, Shen, HT & Nie, F 2011, 'Tag localization with spatial correlations and joint group sparsity', Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Colorado Springs, CO, USA, USA, pp. 881-888.View/Download from: UTS OPUS or Publisher's site
Nowadays numerous social images have been emerging on the Web. How to precisely label these images is critical to image retrieval. However, traditional image-level tagging methods may become less effective because global image matching approaches can hardly cope with the diversity and arbitrariness of Web image content. This raises an urgent need for the fine-grained tagging schemes. In this work, we study how to establish mapping between tags and image regions, i.e. localize tags to image regions, so as to better depict and index the content of images. We propose the spatial group sparse coding (SGSC) by extending the robust encoding ability of group sparse coding with spatial correlations among training regions. We present spatial correlations in a two-dimensional image space and design group-specific spatial kernels to produce a more interpretable regularizer. Further we propose a joint version of the SGSC model which is able to simultaneously encode a group of intrinsically related regions within a test image. An effective algorithm is developed to optimize the objective function of the Joint SGSC. The tag localization task is conducted by propagating tags from sparsely selected groups of regions to the target regions according to the reconstruction coefficients. Extensive experiments on three public image datasets illustrate that our proposed models achieve great performance improvements over the state-of-the-art method in the tag localization task. © 2011 IEEE.
Yang, Y, Nie, F, Xiang, S, Zhuang, Y & Wang, W 2010, 'Local and Global Regressive Mapping for Manifold Learning with Out-of-Sample Extrapolation', PROCEEDINGS OF THE TWENTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE (AAAI-10), 24th AAAI Conference on Artificial Intelligence (AAAI), ASSOC ADVANCEMENT ARTIFICIAL INTELLIGENCE, Atlanta, GA, pp. 649-654.
Yang, Y, Xu, Nie, Luo & Zhuang 2009, 'Ranking with Local Regression and Global Alignment for Cross Media Retrieval', ACM Multimedia.
Yang, Y, Zhuang, Y, Xu, D, Pan, Y, Tao, D & Maybank, S 2009, 'Retrieval Based Interactive Cartoon Synthesis via Unsupervised Bi-Distance Metric Learning', 2009 ACM International Conference on Multimedia Compilation E-Proceedings (with co-located workshops & symposiums), ACM international conference on Multimedia, Association for Computing Machinery, Inc. (ACM), Beijing, China, pp. 311-320.View/Download from: UTS OPUS or Publisher's site
Cartoons play important roles in many areas, but it requires a lot of labor to produce new cartoon clips. In this paper, we propose a gesture recognition method for cartoon character images with two applications, namely content-based cartoon image retrieval and cartoon clip synthesis. We first define Edge Features (EF) and Motion Direction Features (MDF) for cartoon character images. The features are classified into two different groups, namely intra-features and inter-features. An Unsupervised Bi-Distance Metric Learning (UBDML) algorithm is proposed to recognize the gestures of cartoon character images. Different from the previous research efforts on distance metric learning, UBDML learns the optimal distance metric from the heterogeneous distance metrics derived from intra-features and inter-features. Content-based cartoon character image retrieval and cartoon clip synthesis can be carried out based on the distance metric learned by UBDML. Experiments show that the cartoon character image retrieval has a high precision and that the cartoon clip synthesis can be carried out efficiently.
Yang, Y, Zhuang & Wang 2008, 'Heterogeneous Multimedia Data Semantics Mining using Content and Location Context', ACM Multimedia.
Zhuang, Y & Yang, Y 2007, 'Boosting cross-media retrieval by learning with positive and negative examples', ADVANCES IN MULTIMEDIA MODELING, PT 2, 13th International Multimedia Modeling Conference (MMM 2007), SPRINGER-VERLAG BERLIN, Singapore, SINGAPORE, pp. 165-+.
Wu, F, Yang, Y, Zhuang, YT & Pan, YH 2005, 'Understanding multimedia document semantics for cross-media retrieval', ADVANCES IN MULTIMEDIA INFORMATION PROCESSING - PCM 2005, PT 1, 6th Pacific-Rim Conference on Multimedia (PCM 2005), SPRINGER-VERLAG BERLIN, Cheju Isl, SOUTH KOREA, pp. 993-1004.
Person re-identification (re-ID) has become increasingly popular in the community due to its application and research significance. It aims at spotting a person of interest in other cameras. In the early days, hand-crafted algorithms and small-scale evaluation were predominantly reported. Recent years have witnessed the emergence of large-scale datasets and deep learning systems which make use of large data volumes. Considering different tasks, we classify most current re-ID methods into two classes, i.e., image-based and video-based; in both tasks, hand-crafted and deep learning systems will be reviewed. Moreover, two new re-ID tasks which are much closer to real-world applications are described and discussed, i.e., end-to-end re-ID and fast re-ID in very large galleries. This paper: 1) introduces the history of person re-ID and its relationship with image classification and instance retrieval; 2) surveys a broad selection of the hand-crafted systems and the large-scale methods in both image- and video-based re-ID; 3) describes critical future directions in end-to-end re-ID and fast retrieval in large galleries; and 4) finally briefs some important yet under-developed issues.
Clustering is an effective technique in data mining to generate groups that
are the matter of interest. Among various clustering approaches, the family of
k-means algorithms and min-cut algorithms gain most popularity due to their
simplicity and efficacy. The classical k-means algorithm partitions a number of
data points into several subsets by iteratively updating the clustering centers
and the associated data points. By contrast, a weighted undirected graph is
constructed in min-cut algorithms which partition the vertices of the graph
into two sets. However, existing clustering algorithms tend to cluster minority
of data points into a subset, which shall be avoided when the target dataset is
balanced. To achieve more accurate clustering for balanced dataset, we propose
to leverage exclusive lasso on k-means and min-cut to regulate the balance
degree of the clustering results. By optimizing our objective functions that
build atop the exclusive lasso, we can make the clustering result as much
balanced as possible. Extensive experiments on several large-scale datasets
validate the advantage of the proposed algorithms compared to the
state-of-the-art clustering algorithms.
Spectral clustering is a key research topic in the field of machine learning
and data mining. Most of the existing spectral clustering algorithms are built
upon Gaussian Laplacian matrices, which are sensitive to parameters. We propose
a novel parameter free, distance consistent Locally Linear Embedding. The
proposed distance consistent LLE promises that edges between closer data points
have greater weight.Furthermore, we propose a novel improved spectral
clustering via embedded label propagation. Our algorithm is built upon two
advancements of the state of the art:1) label propagation,which propagates a
node\'s labels to neighboring nodes according to their proximity; and 2)
manifold learning, which has been widely used in its capacity to leverage the
manifold structure of data points. First we perform standard spectral
clustering on original data and assign each cluster to k nearest data points.
Next, we propagate labels through dense, unlabeled data regions. Extensive
experiments with various datasets validate the superiority of the proposed
algorithm compared to current state of the art spectral algorithms.