Yi Yang is a professor with the Faculty of Engineering and Information Technology, University of Technology Sydney (UTS). He received the PhD degree in Computer Science from Zhejiang University in 2010. He was a postdoc researcher at the School of Computer Science, Carnegie Mellon University before he came to Australia. See more information about our lab at http://reler.net/.
Can supervise: YES
Ding, Y, Fan, H, Xu, M & Yang, Y 2020, 'Adaptive Exploration for Unsupervised Person Re-identification', ACM Transactions on Multimedia Computing, Communications and Applications, vol. 16, no. 1.View/Download from: Publisher's site
© 2020 ACM. Due to domain bias, directly deploying a deep person re-identification (re-ID) model trained on one dataset often achieves considerably poor accuracy on another dataset. In this article, we propose an Adaptive Exploration (AE) method to address the domain-shift problem for re-ID in an unsupervised manner. Specifically, in the target domain, the re-ID model is inducted to (1) maximize distances between all person images and (2) minimize distances between similar person images. In the first case, by treating each person image as an individual class, a non-parametric classifier with a feature memory is exploited to encourage person images to move far away from each other. In the second case, according to a similarity threshold, our method adaptively selects neighborhoods for each person image in the feature space. By treating these similar person images as the same class, the non-parametric classifier forces them to stay closer. However, a problem of the adaptive selection is that, when an image has too many neighborhoods, it is more likely to attract other images as its neighborhoods. As a result, a minority of images may select a large number of neighborhoods while a majority of images has only a few neighborhoods. To address this issue, we additionally integrate a balance strategy into the adaptive selection. We evaluate our methods with two protocols. The first one is called "target-only re-ID", in which only the unlabeled target data is used for training. The second one is called "domain adaptive re-ID", in which both the source data and the target data are used during training. Experimental results on large-scale re-ID datasets demonstrate the effectiveness of our method. Our code has been released at https://github.com/dyh127/Adaptive-Exploration-for-Unsupervised-Person-….
Feng, Q, Wu, Y, Fan, H, Yan, C, Xu, M & Yang, Y 2020, 'Cascaded Revision Network for Novel Object Captioning', IEEE Transactions on Circuits and Systems for Video Technology, pp. 1-1.View/Download from: Publisher's site
Guan, Q, Huang, Y, Zhong, Z, Zheng, Z, Zheng, L & Yang, Y 2020, 'Thorax disease classification with attention guided convolutional neural network', Pattern Recognition Letters, vol. 131, pp. 38-45.View/Download from: Publisher's site
© 2019 Elsevier B.V. This paper considers the task of thorax disease diagnosis on chest X-ray (CXR) images. Most existing methods generally learn a network with global images as input. However, thorax diseases usually happen in (small) localized areas which are disease specific. Thus training CNNs using global images may be affected by the (excessive) irrelevant noisy areas. Besides, due to the poor alignment of some CXR images, the existence of irregular borders hinders the network performance. For addressing the above problems, we propose to integrate the global and local cues into a three-branch attention guided convolution neural network (AG-CNN) to identify thorax diseases. An attention guided mask inference based cropping strategy is proposed to avoid noise and improve alignment in the global branch. AG-CNN also integrates the global cues to compensate the lost discriminative cues by the local branch. Specifically, we first learn a global CNN branch using global images. Then, guided by the attention heatmap generated from the global branch, we infer a mask to crop a discriminative region from the global image. The local region is used for training a local CNN branch. Lastly, we concatenate the last pooling layers of both the global and local branches for fine-tuning the fusion branch. Experiments on the ChestX-ray14 dataset demonstrate that after integrating the local cues with the global information, the average AUC scores are improved by AG-CNN.
He, Y, Dong, X, Kang, G, Fu, Y, Yan, C & Yang, Y 2020, 'Asymptotic Soft Filter Pruning for Deep Convolutional Neural Networks', IEEE TRANSACTIONS ON CYBERNETICS, vol. 50, no. 8, pp. 3594-3604.View/Download from: Publisher's site
Deeper and wider convolutional neural networks (CNNs) achieve superior performance but bring expensive computation cost. Accelerating such overparameterized neural network has received increased attention. A typical pruning algorithm is a three-stage pipeline, i.e., training, pruning, and retraining. Prevailing approaches fix the pruned filters to zero during retraining and, thus, significantly reduce the optimization space. Besides, they directly prune a large number of filters at first, which would cause unrecoverable information loss. To solve these problems, we propose an asymptotic soft filter pruning (ASFP) method to accelerate the inference procedure of the deep neural networks. First, we update the pruned filters during the retraining stage. As a result, the optimization space of the pruned model would not be reduced but be the same as that of the original model. In this way, the model has enough capacity to learn from the training data. Second, we prune the network asymptotically. We prune few filters at first and asymptotically prune more filters during the training procedure. With asymptotic pruning, the information of the training set would be gradually concentrated in the remaining filters, so the subsequent training and pruning process would be stable. The experiments show the effectiveness of our ASFP on image classification benchmarks. Notably, on ILSVRC-2012, our ASFP reduces more than 40% FLOPs on ResNet-50 with only 0.14% top-5 accuracy degradation, which is higher than the soft filter pruning by 8%.
Lin, Y, Wu, Y, Yan, C, Xu, M & Yang, Y 2020, 'Unsupervised person re-identification via cross-camera similarity exploration', IEEE Transactions on Image Processing, vol. 29, pp. 5481-5490.View/Download from: Publisher's site
© 1992-2012 IEEE. Most person re-identification (re-ID) approaches are based on supervised learning, which requires manually annotated data. However, it is not only resource-intensive to acquire identity annotation but also impractical for large-scale data. To relieve this problem, we propose a cross-camera unsupervised approach that makes use of unsupervised style-transferred images to jointly optimize a convolutional neural network (CNN) and the relationship among the individual samples for person re-ID. Our algorithm considers two fundamental facts in the re-ID task, i.e., variance across diverse cameras and similarity within the same identity. In this paper, we propose an iterative framework which overcomes the camera variance and achieves across-camera similarity exploration. Specifically, we apply an unsupervised style transfer model to generate style-transferred training images with different camera styles. Then we iteratively exploit the similarity within the same identity from both the original and the style-transferred data. We start with considering each training image as a different class to initialize the Convolutional Neural Network (CNN) model. Then we measure the similarity and gradually group similar samples into one class, which increases similarity within each identity. We also introduce a diversity regularization term in the clustering to balance the cluster distribution. The experimental results demonstrate that our algorithm is not only superior to state-of-the-art unsupervised re-ID approaches, but also performs favorably compared with other competing unsupervised domain adaptation methods (UDA) and semi-supervised learning methods.
Liu, W, Chang, X, Chen, L, Phung, D, Zhang, X, Yang, Y & Hauptmann, AG 2020, 'Pair-based uncertainty and diversity promoting early active learning for person re-identification', ACM Transactions on Intelligent Systems and Technology, vol. 11, no. 2.View/Download from: Publisher's site
© 2020 Association for Computing Machinery. The effective training of supervised Person Re-identification (Re-ID) models requires sufficient pairwise labeled data. However, when there is limited annotation resource, it is difficult to collect pairwise labeled data. We consider a challenging and practical problem called Early Active Learning, which is applied to the early stage of experiments when there is no pre-labeled sample available as references for human annotating. Previous early active learning methods suffer from two limitations for Re-ID. First, these instance-based algorithms select instances rather than pairs, which can result in missing optimal pairs for Re-ID. Second, most of these methods only consider the representativeness of instances, which can result in selecting less diverse and less informative pairs. To overcome these limitations, we propose a novel pair-based active learning for Re-ID. Our algorithm selects pairs instead of instances from the entire dataset for annotation. Besides representativeness, we further take into account the uncertainty and the diversity in terms of pairwise relations. Therefore, our algorithm can produce the most representative, informative, and diverse pairs for Re-ID data annotation. Extensive experimental results on five benchmark Re-ID datasets have demonstrated the superiority of the proposed pair-based early active learning algorithm.
Liu, W, Gong, D, Tan, M, Shi, JQ, Yang, Y & Hauptmann, AG 2020, 'Learning Distilled Graph for Large-Scale Social Network Data Clustering', IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, vol. 32, no. 7, pp. 1393-1404.View/Download from: Publisher's site
Ma, F, Meng, D, Dong, X & Yang, Y 2020, 'Self-paced multi-view co-training', Journal of Machine Learning Research, vol. 21.
© 2020 Fan Ma, Deyu Meng, Xuanyi Dong and Yi Yang. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v21/18-794.html. Co-training is a well-known semi-supervised learning approach which trains classifiers on two or more different views and exchanges pseudo labels of unlabeled instances in an iterative way. During the co-training process, pseudo labels of unlabeled instances are very likely to be false especially in the initial training, while the standard co-training algorithm adopts a "draw without replacement" strategy and does not remove these wrongly labeled instances from training stages. Besides, most of the traditional co-training approaches are implemented for two-view cases, and their extensions in multi-view scenarios are not intuitive. These issues not only degenerate their performance as well as available application range but also hamper their fundamental theory. Moreover, there is no optimization model to explain the objective a co-training process manages to optimize. To address these issues, in this study we design a unified self-paced multi-view co-training (SPamCo) framework which draws unlabeled instances with replacement. Two specified co-regularization terms are formulated to develop different strategies for selecting pseudo-labeled instances during training. Both forms share the same optimization strategy which is consistent with the iteration process in co-training and can be naturally extended to multi-view scenarios. A distributed optimization strategy is also introduced to train the classifier of each view in parallel to further improve the efficiency of the algorithm. Furthermore, the SPamCo algorithm is proved to be PAC learnable, supporting its theoretical soundness. Experiments conducted on synthetic, text categorization, person re-identification, image recognition and object detection data sets substantiate the superiority of the proposed method.
Sun, Y, Zheng, L, Li, Y, Yang, Y, Tian, Q & Wang, S 2020, 'Learning Part-based Convolutional Features for Person Re-identification', IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1-1.View/Download from: Publisher's site
Wu, A, Han, Y, Yang, Y, Hu, Q & Wu, F 2020, 'Convolutional Reconstruction-to-Sequence for Video Captioning', IEEE Transactions on Circuits and Systems for Video Technology, pp. 1-1.View/Download from: Publisher's site
Zhang, L, Luo, M, Liu, J, Chang, X, Yang, Y & Hauptmann, AG 2020, 'Deep Top-k Ranking for Image-Sentence Matching', IEEE Transactions on Multimedia, vol. 22, no. 3, pp. 775-785.View/Download from: Publisher's site
© 1999-2012 IEEE. Image-sentence matching is a challenging task for the heterogeneity-gap between different modalities. Ranking-based methods have achieved excellent performance in this task in past decades. Given an image query, these methods typically assume that the correct matched image-sentence pair must rank before all other mismatched ones. However, this assumption may be too strict and prone to the overfitting problem, especially when some sentences in a massive database are similar and confusable with one another. In this paper, we relax the traditional ranking loss and propose a novel deep multi-modal network with a top-k ranking loss to mitigate the data ambiguity problem. With this strategy, query results will not be penalized unless the index of ground truth is outside the range of top-k query results. Considering the non-smoothness and non-convexity of the initial top-k ranking loss, we exploit a tight convex upper bound to approximate the loss and then utilize the traditional back-propagation algorithm to optimize the deep multi-modal network. Finally, we apply the method on three benchmark datasets, namely, Flickr8k, Flickr30k, and MSCOCO. Empirical results on metrics R@K (K = 1, 5, 10) show that our method achieves comparable performance in comparison to state-of-the-art methods.
Zheng, Z, Zheng, L, Garrett, M, Yang, Y, Xu, M & Shen, YD 2020, 'Dual-path Convolutional Image-Text Embeddings with Instance Loss', ACM Transactions on Multimedia Computing, Communications and Applications, vol. 16, no. 2.View/Download from: Publisher's site
© 2020 ACM. Matching images and sentences demands a fine understanding of both modalities. In this article, we propose a new system to discriminatively embed the image and text to a shared visual-textual space. In this field, most existing works apply the ranking loss to pull the positive image/text pairs close and push the negative pairs apart from each other. However, directly deploying the ranking loss on heterogeneous features (i.e., text and image features) is less effective, because it is hard to find appropriate triplets at the beginning. So the naive way of using the ranking loss may compromise the network from learning inter-modal relationship. To address this problem, we propose the instance loss, which explicitly considers the intra-modal data distribution. It is based on an unsupervised assumption that each image/text group can be viewed as a class. So the network can learn the fine granularity from every image/text group. The experiment shows that the instance loss offers better weight initialization for the ranking loss, so that more discriminative embeddings can be learned. Besides, existing works usually apply the off-the-shelf features, i.e., word2vec and fixed visual feature. So in a minor contribution, this article constructs an end-to-end dual-path convolutional network to learn the image and text representations. End-to-end learning allows the system to directly learn from the data and fully utilize the supervision. On two generic retrieval datasets (Flickr30k and MSCOCO), experiments demonstrate that our method yields competitive accuracy compared to state-of-the-art methods. Moreover, in language-based person retrieval, we improve the state of the art by a large margin. The code has been made publicly available.
Zhong, Z, Zheng, L, Luo, Z, Li, S & Yang, Y 2020, 'Learning to Adapt Invariance in Memory for Person Re-identification', IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1-1.View/Download from: Publisher's site
Zhou, R, Chang, X, Shi, L, Shen, YD, Yang, Y & Nie, F 2020, 'Person Reidentification via Multi-Feature Fusion with Adaptive Graph Learning', IEEE Transactions on Neural Networks and Learning Systems, vol. 31, no. 5, pp. 1592-1601.View/Download from: Publisher's site
© 2012 IEEE. The goal of person reidentification (Re-ID) is to identify a given pedestrian from a network of nonoverlapping surveillance cameras. Most existing works follow the supervised learning paradigm which requires pairwise labeled training data for each pair of cameras. However, this limits their scalability to real-world applications where abundant unlabeled data are available. To address this issue, we propose a multi-feature fusion with adaptive graph learning model for unsupervised Re-ID. Our model aims to negotiate comprehensive assessment on the consistent graph structure of pedestrians with the help of special information of feature descriptors. Specifically, we incorporate multi-feature dictionary learning and adaptive multi-feature graph learning into a unified learning model such that the learned dictionaries are discriminative and the subsequent graph structure learning is accurate. An alternating optimization algorithm with proved convergence is developed to solve the final optimization objective. Extensive experiments on four benchmark data sets demonstrate the superiority and effectiveness of the proposed method.
Yan, Y, Tan, M, Tsang, I, Yang, Y, Shi, Q & Zhang, C 2020, 'Fast and Low Memory Cost Matrix Factorization: Algorithm, Analysis and Case Study', IEEE Transactions on Knowledge and Data Engineering.View/Download from: Publisher's site
IEEE Matrix factorization has been widely applied to various applications. With the fast development of storage and internet technologies, we have been witnessing a rapid increase of data. In this paper, we propose new algorithms for matrix factorization with the emphasis on efficiency. In addition, most existing methods of matrix factorization only consider a general smooth least square loss. Differently, many real-world applications have distinctive characteristics. As a result, different losses should be used accordingly. Therefore, it is beneficial to design new matrix factorization algorithms that are able to deal with both smooth and non-smooth losses. To this end, one needs to analyze the characteristics of target data and use the most appropriate loss based on the analysis. We particularly study two representative cases of low-rank matrix recovery, i.e., collaborative filtering for recommendation and high dynamic range imaging. To solve these two problems, we respectively propose a stage-wise matrix factorization algorithm by exploiting manifold optimization techniques. From our theoretical analysis, they are both are provably guaranteed to converge to a stationary point. Extensive experiments on recommender systems and high dynamic range imaging demonstrate the satisfactory performance and efficiency of our proposed method on large-scale real data.
Du, X, Yin, H, Chen, L, Wang, Y, Yang, Y & Zhou, X 2020, 'Personalized Video Recommendation Using Rich Contents from Videos', IEEE Transactions on Knowledge and Data Engineering.View/Download from: Publisher's site
IEEE Video recommendation has become an essential way of helping people explore the massive videos and discover the ones that may be of interest to them. In the existing video recommender systems, the models make the recommendations based on the user-video interactions and single specific content features. When the specific content features are unavailable, the performance of the existing models will seriously deteriorate. Inspired by the fact that rich contents (e.g., text, audio, motion, and so on) exist in videos, in this paper, we explore how to use these rich contents to overcome the limitations caused by the unavailability of the specific ones. Specifically, we propose a novel general framework that incorporates arbitrary single content feature with user-video interactions, named as collaborative embedding regression (CER) model, to make effective video recommendation in both in-matrix and out-of-matrix scenarios. Our extensive experiments on two real-world large-scale datasets show that CER beats the existing recommender models with any single content feature and is more time efficient. In addition, we propose a priority-based late fusion (PRI) method to gain the benefit brought by the integrating the multiple content features. The corresponding experiment shows that PRI brings real performance improvement to the baseline and outperforms the existing fusion methods.
Dong, X, Yan, Y, Tan, M, Yang, Y & Tsang, WHI 2019, 'Late Fusion via Subspace Search With Consistency Preservation.', IEEE transactions on image processing : a publication of the IEEE Signal Processing Society, vol. 28, no. 1, pp. 518-528.View/Download from: Publisher's site
In many real-world applications, data can be represented by multiple ways or multi-view features to describe various characteristics of data. In this sense, the prediction performance can be significantly improved by taking advantages of these features together. Late fusion, which combines the predictions of multiple features, is a commonly used approach to make the final decision for a test instance. However, it is ubiquitous that different features dispute the prediction on the same data with each other, leading to performance degeneration. In this paper, we propose an efficient and effective matrix factorization-based approach to fuse predictions from multiple sources. This approach leverages a hard constraint on the matrix rank to preserve the consistency of predictions by various features, and we thus named it as Hard-rank Constraint Matrix Factorization-based fusion (HCMF). HCMF can avoid the performance degeneration caused by the controversy of multiple features. Extensive experiments demonstrate the efficacy of HCMF for outlier detection and the performance improvement, which outperforms the state-of-the-art late fusion algorithms on many data sets.
Dong, X, Zheng, L, Ma, F, Yang, Y & Meng, D 2019, 'Few-Example Object Detection with Model Communication', IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 7, pp. 1641-1654.View/Download from: Publisher's site
IEEE In this paper, we study object detection using a large pool of unlabeled images and only a few labeled images per category, named "few-example object detection". The key challenge consists in generating trustworthy training samples as many as possible from the pool. Using few training examples as seeds, our method iterates between model training and high-confidence sample selection. In training, easy samples are generated first and, then the poorly initialized model undergoes improvement. As the model becomes more discriminative, challenging but reliable samples are selected. After that, another round of model improvement takes place. To further improve the precision and recall of the generated training samples, we embed multiple detection models in our framework, which has proven to outperform the single model baseline and the model ensemble method. Experiments on PASCAL VOC'07, MS COCO'14, and ILSVRC'13 indicate that by using as few as three or four samples selected for each category, our method produces very competitive results when compared to the state-of-the-art weakly-supervised approaches using a large number of image-level labels.
Du, X, Nie, F, Wang, W, Yang, Y & Zhou, X 2019, 'Exploiting Combination Effect for Unsupervised Feature Selection by l2,0 Norm.', IEEE transactions on neural networks and learning systems, vol. 30, no. 1, pp. 201-214.View/Download from: Publisher's site
In learning applications, exploring the cluster structures of the high dimensional data is an important task. It requires projecting or visualizing the cluster structures into a low dimensional space. The challenges are: 1) how to perform the projection or visualization with less information loss and 2) how to preserve the interpretability of the original data. Recent methods address these challenges simultaneously by unsupervised feature selection. They learn the cluster indicators based on the k nearest neighbor similarity graph, then select the features highly correlated with these indicators. Under this direction, many techniques, such as local discriminative analysis, nonnegative spectral analysis, nonnegative matrix factorization, etc., have been successfully introduced to make the selection more accurate. In this paper, we focus on enhancing the unsupervised feature selection in another perspective, namely, making the selection exploit the combination effect of the features. Given the expected feature amount, previous works operate on the whole features then select those of high coefficients one by one as the output. Our proposed method, instead, operates on a group of features initially then update the selection when a better group appears. Compared to the previous methods, the proposed method exploits the combination effect of the features by l2,0 norm. It improves the selection accuracy where the cluster structures are strongly related to a group of features. We conduct the experiments on six open access data sets from different domains. The experimental results show that our proposed method is more accurate than the recent methods which do not specially consider the combination effect of the features.
Lin, Y, Zheng, L, Zheng, Z, Wu, Y, Hu, Z, Yan, C & Yang, Y 2019, 'Improving person re-identification by attribute and identity learning', Pattern Recognition, vol. 95, pp. 151-161.View/Download from: Publisher's site
© 2019 Elsevier Ltd Person re-identification (re-ID) and attribute recognition share a common target at learning pedestrian descriptions. Their difference consists in the granularity. Most existing re-ID methods only take identity labels of pedestrians into consideration. However, we find the attributes, containing detailed local descriptions, are beneficial in allowing the re-ID model to learn more discriminative feature representations. In this paper, based on the complementarity of attribute labels and ID labels, we propose an attribute-person recognition (APR) network, a multi-task network which learns a re-ID embedding and at the same time predicts pedestrian attributes. We manually annotate attribute labels for two large-scale re-ID datasets, and systematically investigate how person re-ID and attribute recognition benefit from each other. In addition, we re-weight the attribute predictions considering the dependencies and correlations among the attributes. The experimental results on two large-scale re-ID benchmarks demonstrate that by learning a more discriminative representation, APR achieves competitive re-ID performance compared with the state-of-the-art methods. We use APR to speed up the retrieval process by ten times with a minor accuracy drop of 2.92% on Market-1501. Besides, we also apply APR on the attribute recognition task and demonstrate improvement over the baselines.
© 2018 Elsevier B.V. Person re-identification (re-ID) is challenging because pedestrians may exhibit distinct appearance under different cameras. Given a query image, previous methods usually output the person retrieval results directly, which may perform badly due to the limited information provided by the single query image. To mine more query information, we add an expansion step to post-process the initial ranking list. The intuition is that a true match in the gallery may be difficult to be found by the query alone, but it can be easily retrieved by other true matches in the initial ranking list. In this paper, we propose the Bayesian Query Expansion (BQE) method to generate a new query with information from the initial ranking list. The Bayesian model is used to predict true matches in the gallery. We apply pooling on the features of these "true matches" to get a single vector, i.e., the expanded new query, with which the retrieval process is performed again to obtain the final results. We evaluate BQE with various feature extraction methods and distance metric learning methods on four large-scale re-ID datasets. We observe consistent improvement over all the baselines and report competitive performances compared with the state-of-the-art results.
Liu, R, Zhao, Y, Wei, S, Zheng, L & Yang, Y 2019, 'Modality-invariant image-text embedding for image-sentence matching', ACM Transactions on Multimedia Computing, Communications and Applications, vol. 15, no. 1.View/Download from: Publisher's site
© 2019 Association for Computing Machinery. Performing direct matching among different modalities (like image and text) can benefit many tasks in computer vision, multimedia, information retrieval, and information fusion. Most of existing works focus on class-level image-text matching, called cross-modal retrieval, which attempts to propose a uniform model for matching images with all types of texts, for example, tags, sentences, and articles (long texts). Although cross-model retrieval alleviates the heterogeneous gap among visual and textual information, it can provide only a rough correspondence between two modalities. In this article, we propose a more precise image-text embedding method, image-sentence matching, which can provide heterogeneous matching in the instance level. The key issue for image-text embedding is how to make the distributions of the two modalities consistent in the embedding space. To address this problem, some previous works on the cross-model retrieval task have attempted to pull close their distributions by employing adversarial learning. However, the effectiveness of adversarial learning on image-sentence matching has not been proved and there is still not an effective method. Inspired by previous works, we propose to learn a modality-invariant image-text embedding for image-sentence matching by involving adversarial learning. On top of the triplet loss-based baseline, we design a modality classification network with an adversarial loss, which classifies an embedding into either the image or text modality. In addition, the multi-stage training procedure is carefully designed so that the proposed network not only imposes the image-text similarity constraints by ground-truth labels, but also enforces the image and text embedding distributions to be similar by adversarial learning. Experiments on two public datasets (Flickr30k and MSCOCO) demonstrate that our method yields stable accuracy improvement over the baseline model and that our ...
Wu, Y, Lin, Y, Dong, X, Yan, Y, Bian, W & Yang, Y 2019, 'Progressive Learning for Person Re-Identification with One Example.', IEEE Transactions on Image Processing, vol. 28, no. 6, pp. 2872-2881.View/Download from: Publisher's site
In this paper, we focus on the one-example person re-identification (re-ID) task, where each identity has only one labeled example along with many unlabeled examples. We propose a progressive framework which gradually exploits the unlabeled data for person re-ID. In this framework, we iteratively (1) update the Convolutional Neural Network (CNN) model and (2) estimate pseudo labels for the unlabeled data. We split the training data into three parts, i.e., labeled data, pseudo-labeled data, and indexlabeled data. Initially, the re-ID model is trained using the labeled data. For the subsequent model training, we update the CNN model by the joint training on the three data parts. The proposed joint training method can optimize the model by both the data with labels (or pseudo labels) and the data without any reliable labels. For the label estimation step, instead of using a static sampling strategy, we propose a progressive sampling strategy to increase the number of the selected pseudo-labeled candidates step by step. We select a few candidates with most reliable pseudo labels from unlabeled examples as the pseudo-labeled data, and keep the rest as index-labeled data by assigning them with the data indexes. During iterations, the index-labeled data are dynamically transferred to pseudo-labeled data. Notably, the rank-1 accuracy of our method outperforms the state-of-the-art method by 21.6 points (absolute, i.e., 62.8% vs. 41.2%) on MARS, and 16.6 points on DukeMTMC-VideoReID. Extended to the few-example setting, our approach with only 20% labeled data surprisingly achieves comparable performance to the supervised state-of-the-art method with 100% labeled data.
Yao, Y, Wang, L, Zhang, L, Yang, Y, Li, P, Zimmermann, R & Shao, L 2019, 'Learning Latent Stable Patterns for Image Understanding With Weak and Noisy Labels.', IEEE transactions on cybernetics, vol. 49, no. 12, pp. 4243-4252.View/Download from: Publisher's site
This paper focuses on weakly supervised image understanding, in which the semantic labels are available only at image-level, without the specific object or scene location in an image. Existing algorithms implicitly assume that image-level labels are error-free, which might be too restrictive. In practice, image labels obtained from the pretrained predictors are easily contaminated. To solve this problem, we propose a novel algorithm for weakly supervised segmentation when only noisy image labels are available during training. More specifically, a semantic space is constructed first by encoding image labels through a graphlet (i.e., superpixel cluster) embedding process. Then, we observe that in the semantic space, the distribution of graphlets from images with a same label remains stable, regardless of the noises in image labels. Therefore, we propose a generative model, called latent stability analysis, to discover the stable patterns from images with noisy labels. Inferring graphlet semantics by making use of these mid-level stable patterns is much more secure and accurate than directly transferring noisy image-level labels into different regions. Finally, we calculate the semantics of each superpixel using maximum majority voting of its correlated graphlets. Comprehensive experimental results show that our algorithm performs impressively when the image labels are predicted by either the hand-crafted or deeply learned image descriptors.
Zhan, K, Chang, X, Guan, J, Chen, L, Ma, Z & Yang, Y 2019, 'Adaptive Structure Discovery for Multimedia Analysis Using Multiple Features.', IEEE Transactions on Cybernetics, vol. 49, no. 5, pp. 1826-1834.View/Download from: Publisher's site
Multifeature learning has been a fundamental research problem in multimedia analysis. Most existing multifeature learning methods exploit graph, which must be computed beforehand, as input to uncover data distribution. These methods have two major problems confronted. First, graph construction requires calculating similarity based on nearby data pairs by a fixed function, e.g., the RBF kernel, but the intrinsic correlation among different data pairs varies constantly. Therefore, feature learning based on such predefined graphs may degrade, especially when there is dramatic correlation variation between nearby data pairs. Second, in most existing algorithms, each single-feature graph is computed independently and then combine them for learning, which ignores the correlation between multiple features. In this paper, a new unsupervised multifeature learning method is proposed to make the best utilization of the correlation among different features by jointly optimizing data correlation from multiple features in an adaptive way. As opposed to computing the affinity weight of data pairs by a fixed function, the weight of affinity graph is learned by a well-designed optimization problem. Additionally, the affinity graph of data pairs from different features is optimized in a global level to better leverage the correlation among different channels. In this way, the adaptive approach correlates the features of all features for a better learning process. Experimental results on real-world datasets demonstrate that our approach outperforms the state-of-the-art algorithms on leveraging multiple features for multimedia analysis.
A graph is usually formed to reveal the relationship between data points and graph structure is encoded by the affinity matrix. Most graph-based multiview clustering methods use predefined affinity matrices and the clustering performance highly depends on the quality of graph. We learn a consensus graph with minimizing disagreement between different views and constraining the rank of the Laplacian matrix. Since diverse views admit the same underlying cluster structure across multiple views, we use a new disagreement cost function for regularizing graphs from different views toward a common consensus. Simultaneously, we impose a rank constraint on the Laplacian matrix to learn the consensus graph with exactly connected components where is the number of clusters, which is different from using fixed affinity matrices in most existing graph-based methods. With the learned consensus graph, we can directly obtain the cluster labels without performing any post-processing, such as -means clustering algorithm in spectral clustering-based methods. A multiview consensus clustering method is proposed to learn such a graph. An efficient iterative updating algorithm is derived to optimize the proposed challenging optimization problem. Experiments on several benchmark datasets have demonstrated the effectiveness of the proposed method in terms of seven metrics.
Zhan, K, Niu, C, Chen, C, Nie, F, Zhang, C & Yang, Y 2019, 'Graph Structure Fusion for Multiview Clustering', IEEE Transactions on Knowledge and Data Engineering, vol. 31, no. 10, pp. 1984-1993.View/Download from: Publisher's site
IEEE Many existing multiview clustering methods take graphs, which are usually pre-computed independently in each view, as input to uncover data distribution. These methods ignore the correlation of graph structure among multiple views, and the clustering results highly depend on the quality of predefined affinity graphs. We address the problem of multiview clustering by seamlessly integrating the graph structures of different views to fully exploit the geometric property of underlying data structure. The proposed method is based on the assumption that the intrinsic underlying graph structure would assign corresponding connected component in each graph to the same cluster. Different graphs from multiple views are integrated by using Hadamard product since different views usually together admit the same underlying structure across multiple views. Specifically, the graphs are integrated into a global one and the structure of the global graph is adaptively tuned by a well-designed objective function so that the number of components of the graph is exactly equal to the number of clusters. It is worth noting that we directly obtain cluster indicators from the graph itself without performing further graph-cut or k-means clustering steps. Experiments show the proposed method obtains better clustering performance than the state-of-the-art methods.
Zheng, L, Huang, Y, Lu, H & Yang, Y 2019, 'Pose Invariant Embedding for Deep Person Re-identification.', IEEE Transactions on Image Processing, vol. 28, no. 9, pp. 4500-4509.View/Download from: Publisher's site
Pedestrian misalignment, which mainly arises from detector errors and pose variations, is a critical problem for a robust person re-identification (re-ID) system. With poor alignment, the feature learning and matching process might be largely compromised. To address this problem, this paper introduces the pose invariant embedding (PIE) as a pedestrian descriptor. First, in order to align pedestrians to a standard pose, the PoseBox structure is introduced, which is generated through pose estimation followed by affine transformations. Second, to reduce the impact of pose estimation errors and information loss during PoseBox construction, we design a PoseBox fusion (PBF) CNN architecture that takes the original image, the PoseBox, and the pose estimation confidence as input. The proposed PIE descriptor is thus defined as the fully connected layer of the PBF network for the retrieval task. Experiments are conducted on the Market-1501, CUHK03-NP, and DukeMTMC-reID datasets. We show that PoseBox alone yields decent re-ID accuracy, and that when integrated in the PBF network, the learned PIE descriptor produces competitive performance compared with the state-of-the-art approaches.
Zheng, Z, Zheng, L & Yang, Y 2019, 'Pedestrian Alignment Network for Large-scale Person Re-identification', IEEE Transactions on Circuits and Systems for Video Technology, vol. 29, no. 10, pp. 3037-3045.View/Download from: Publisher's site
IEEE Person re-identification (re-ID) is mostly viewed as an image retrieval problem. This task aims to search a query person in a large image pool. In practice, person re-ID usually adopts automatic detectors to obtain cropped pedestrian images. However, this process suffers from two types of detector errors: excessive background and part missing. Both errors deteriorate the quality of pedestrian alignment and may compromise pedestrian matching due to the position and scale variances. To address the misalignment problem, we propose that alignment be learned from an identification procedure. We introduce the pedestrian alignment network (PAN) which allows discriminative embedding learning pedestrian alignment without extra annotations. We observe that when the convolutional neural network (CNN) learns to discriminate between different identities, the learned feature maps usually exhibit strong activations on the human body rather than the background. The proposed network thus takes advantage of this attention mechanism to adaptively locate and align pedestrians within a bounding box. Visual examples show that pedestrians are better aligned with PAN. Experiments on three large-scale re-ID datasets confirm that PAN improves the discriminative ability of the feature embeddings and yields competitive accuracy with the state-of-the-art methods.
Zhong, Z, Zheng, L, Zheng, Z, Li, S & Yang, Y 2019, 'CamStyle: A Novel Data Augmentation Method for Person Re-Identification.', IEEE transactions on image processing : a publication of the IEEE Signal Processing Society, vol. 28, no. 3, pp. 1176-1190.View/Download from: Publisher's site
Person re-identification (re-ID) is a cross-camera retrieval task that suffers from image style variations caused by different cameras. The art implicitly addresses this problem by learning a camera-invariant descriptor subspace. In this paper, we explicitly consider this challenge by introducing camera style (CamStyle). CamStyle can serve as a data augmentation approach that reduces the risk of deep network overfitting and that smooths the CamStyle disparities. Specifically, with a style transfer model, labeled training images can be style transferred to each camera, and along with the original training samples, form the augmented training set. This method, while increasing data diversity against overfitting, also incurs a considerable level of noise. In the effort to alleviate the impact of noise, the label smooth regularization (LSR) is adopted. The vanilla version of our method (without LSR) performs reasonably well on few camera systems in which overfitting often occurs. With LSR, we demonstrate consistent improvement in all systems regardless of the extent of overfitting. We also report competitive accuracy compared with the state of the art on Market-1501 and DukeMTMC-re-ID. Importantly, CamStyle can be employed to the challenging problems of one view learning and unsupervised domain adaptation (UDA) in person re-identification (re-ID), both of which have critical research and application significance. The former only has labeled data in one camera view and the latter only has labeled data in the source domain. Experimental results show that CamStyle significantly improves the performance of the baseline in the two problems. Specially, for UDA, CamStyle achieves state-of-the-art accuracy based on a baseline deep re-ID model on Market-1501 and DukeMTMC-reID. Our code is available at: https://github.com/zhunzhong07/CamStyle .
Du, X, Yin, H, Huang, Z, Yang, Y & Zhou, X 2018, 'Exploiting detected visual objects for frame-level video filtering', World Wide Web, vol. 21, no. 5, pp. 1259-1284.View/Download from: Publisher's site
© 2017 Springer Science+Business Media, LLC Videos are generated at an unprecedented speed on the web. To improve the efficiency of access, developing new ways to filter the videos becomes a popular research topic. One on-going direction is using visual objects to perform frame-level video filtering. Under this direction, existing works create the unique object table and the occurrence table to maintain the connections between videos and objects. However, the creation process is not scalable and dynamic because it heavily depends on human labeling. To improve this, we propose to use detected visual objects to create these two tables for frame-level video filtering. Our study begins with investigating the existing object detection techniques. After that, we find object detection lacks the identification and connection abilities to accomplish the creation process alone. To supply these abilities, we further investigate three candidates, namely, recognizing-based, matching-based and tracking-based methods, to work with the object detection. Through analyzing the mechanism and evaluating the accuracy, we find that they are imperfect for identifying or connecting the visual objects. Accordingly, we propose a novel hybrid method that combines the matching-based and tracking-based methods to overcome the limitations. Our experiments show that the proposed method achieves higher accuracy and efficiency than the candidate methods. The subsequent analysis shows that the proposed method can efficiently support the frame-level video filtering using visual objects.
Fan, H, Zheng, L, Yan, C & Yang, Y 2018, 'Unsupervised Person Re-identification: Clustering and Fine-tuning', ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, vol. 14, no. 4.View/Download from: Publisher's site
Hu, Y, Zheng, L, Yang, Y & Huang, Y 2018, 'Twitter100k: A Real-World Dataset for Weakly Supervised Cross-Media Retrieval', IEEE Transactions on Multimedia, vol. 20, no. 4, pp. 927-938.View/Download from: Publisher's site
© 2017 IEEE. This paper contributes a new large-scale dataset for weakly supervised cross-media retrieval, named Twitter100k. Current datasets, such as Wikipedia, NUS Wide, and Flickr30k, have two major limitations. First, these datasets are lacking in content diversity, i.e., only some predefined classes are covered. Second, texts in these datasets are written in well-organized language, leading to inconsistency with realistic applications. To overcome these drawbacks, the proposed Twitter100k dataset is characterized by two aspects: it has 100 000 image-text pairs randomly crawled from Twitter, and thus, has no constraint in the image categories; and text in Twitter100k is written in informal language by the users. Since strongly supervised methods leverage the class labels that may be missing in practice, this paper focuses on weakly supervised learning for cross-media retrieval, in which only text-image pairs are exploited during training. We extensively benchmark the performance of four subspace learning methods and three variants of the correspondence AutoEncoder, along with various text features on Wikipedia, Flickr30k, and Twitter100k. As a minor contribution, we also design a deep neural network to learn cross-modal embeddings for Twitter100k. Inspired by the characteristic of Twitter100k, we propose a method to integrate optical character recognition into cross-media retrieval. The experiment results show that the proposed method improves the baseline performance.
Li, Z, Nie, F, Chang, X, Nie, L, Zhang, H & Yang, Y 2018, 'Rank-Constrained Spectral Clustering With Flexible Embedding.', IEEE transactions on neural networks and learning systems, vol. 29, no. 12, pp. 6073-6082.View/Download from: Publisher's site
Spectral clustering (SC) has been proven to be effective in various applications. However, the learning scheme of SC is suboptimal in that it learns the cluster indicator from a fixed graph structure, which usually requires a rounding procedure to further partition the data. Also, the obtained cluster number cannot reflect the ground truth number of connected components in the graph. To alleviate these drawbacks, we propose a rank-constrained SC with flexible embedding framework. Specifically, an adaptive probabilistic neighborhood learning process is employed to recover the block-diagonal affinity matrix of an ideal graph. Meanwhile, a flexible embedding scheme is learned to unravel the intrinsic cluster structure in low-dimensional subspace, where the irrelevant information and noise in high-dimensional data have been effectively suppressed. The proposed method is superior to previous SC methods in that: 1) the block-diagonal affinity matrix learned simultaneously with the adaptive graph construction process, more explicitly induces the cluster membership without further discretization; 2) the number of clusters is guaranteed to converge to the ground truth via a rank constraint on the Laplacian matrix; and 3) the mismatch between the embedded feature and the projected feature allows more freedom for finding the proper cluster structure in the low-dimensional subspace as well as learning the corresponding projection matrix. Experimental results on both synthetic and real-world data sets demonstrate the promising performance of the proposed algorithm.
Liu, R, Wei, S, Zhao, Y & Yang, Y 2018, 'Indexing of the CNN features for the large scale image search', Multimedia Tools and Applications, vol. 77, no. 24, pp. 32107-32131.View/Download from: Publisher's site
© 2018, Springer Science+Business Media, LLC, part of Springer Nature. The convolutional neural network (CNN) features can give good description of image content, which usually represent an image with a single feature vector. Although CNN features are more compact than local descriptors, they still cannot efficiently deal with large-scale retrieval due to the linearly incremental cost of computation and storage. To address this issue, we build a simple but effective indexing framework on inverted table, which significantly decreases both search time and memory usage. First, several strategies are fully investigated to adapt inverted table to CNN features for compensating for quantization error. We use multiple assignment for the query and database images to increase the probability that relevant images are assigned to the same visual word obtained via clustering. Embedding codes are also introduced to improve retrieval accuracy by removing false matches. Second, a novel indexing framework that combines inverted table and hashing codes is proposed. This framework is faster than the reformed inverted tables with the introduced strategies. Experiment on several benchmark datasets demonstrates that our method yields faster retrieval speed compared to brute-force search. We also provide fair comparison between popular CNN features.
Liu, W, Chang, X, Yan, Y, Yang, Y & Hauptmann, AG 2018, 'Few-shot text and image classification via analogical transfer learning', ACM Transactions on Intelligent Systems and Technology, vol. 9, no. 6.View/Download from: Publisher's site
© 2018 ACM. Learning from very few samples is a challenge for machine learning tasks, such as text and image classifcation. Performance of such task can be enhanced via transfer of helpful knowledge from related domains, which is referred to as transfer learning. In previous transfer learning works, instance transfer learning algorithms mostly focus on selecting the source domain instances similar to the target domain instances for transfer. However, the selected instances usually do not directly contribute to the learning performance in the target domain. Hypothesis transfer learning algorithms focus on the model/parameter level transfer. They treat the source hypotheses as well-trained and transfer their knowledge in terms of parameters to learn the target hypothesis. Such algorithms directly optimize the target hypothesis by the observable performance improvements. However, they fail to consider the problem that instances that contribute to the source hypotheses may be harmful for the target hypothesis, as instance transfer learning analyzed. To relieve the aforementioned problems, we propose a novel transfer learning algorithm, which follows an analogical strategy. Particularly, the proposed algorithm frst learns a revised source hypothesis with only instances contributing to the target hypothesis. Then, the proposed algorithm transfers both the revised source hypothesis and the target hypothesis (only trained with a few samples) to learn an analogical hypothesis. We denote our algorithm as Analogical Transfer Learning. Extensive experiments on one synthetic dataset and three real-world benchmark datasets demonstrate the superior performance of the proposed algorithm.
Luo, M, Chang, X, Nie, L, Yang, Y, Hauptmann, AG & Zheng, Q 2018, 'An Adaptive Semisupervised Feature Analysis for Video Semantic Recognition', IEEE Transactions on Cybernetics, vol. 48, no. 2, pp. 648-660.View/Download from: Publisher's site
© 2017 IEEE. Video semantic recognition usually suffers from the curse of dimensionality and the absence of enough high-quality labeled instances, thus semisupervised feature selection gains increasing attentions for its efficiency and comprehensibility. Most of the previous methods assume that videos with close distance (neighbors) have similar labels and characterize the intrinsic local structure through a predetermined graph of both labeled and unlabeled data. However, besides the parameter tuning problem underlying the construction of the graph, the affinity measurement in the original feature space usually suffers from the curse of dimensionality. Additionally, the predetermined graph separates itself from the procedure of feature selection, which might lead to downgraded performance for video semantic recognition. In this paper, we exploit a novel semisupervised feature selection meth od from a new perspective. The primary assumption underlying our model is that the instances with similar labels should have a larger probability of being neighbors. Instead of using a predetermined similarity graph, we incorporate the exploration of the local structure into the procedure of joint feature selection so as to learn the optimal graph simultaneously. Moreover, an adaptive loss function is exploited to measure the label fitness, which significantly enhances model's robustness to videos with a small or substantial loss. We propose an efficient alternating optimization algorithm to solve the proposed challenging problem, together with analyses on its convergence and computational complexity in theory. Finally, extensive experimental results on benchmark datasets illustrate the effectiveness and superiority of the proposed approach on video semantic recognition related tasks.
Luo, M, Nie, F, Chang, X, Yang, Y, Hauptmann, AG & Zheng, Q 2018, 'Adaptive Unsupervised Feature Selection with Structure Regularization', IEEE Transactions on Neural Networks and Learning Systems, vol. 29, no. 4, pp. 944-956.View/Download from: Publisher's site
© 2012 IEEE. Feature selection is one of the most important dimension reduction techniques for its efficiency and interpretation. Since practical data in large scale are usually collected without labels, and labeling these data are dramatically expensive and time-consuming, unsupervised feature selection has become a ubiquitous and challenging problem. Without label information, the fundamental problem of unsupervised feature selection lies in how to characterize the geometry structure of original feature space and produce a faithful feature subset, which preserves the intrinsic structure accurately. In this paper, we characterize the intrinsic local structure by an adaptive reconstruction graph and simultaneously consider its multiconnected-components (multicluster) structure by imposing a rank constraint on the corresponding Laplacian matrix. To achieve a desirable feature subset, we learn the optimal reconstruction graph and selective matrix simultaneously, instead of using a predetermined graph. We exploit an efficient alternative optimization algorithm to solve the proposed challenging problem, together with the theoretical analyses on its convergence and computational complexity. Finally, extensive experiments on clustering task are conducted over several benchmark data sets to verify the effectiveness and superiority of the proposed unsupervised feature selection algorithm.
Wang, H, Wu, F, Lu, W, Yang, Y, Li, X, Li, X & Zhuang, Y 2018, 'Identifying Objective and Subjective Words via Topic Modeling', IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, vol. 29, no. 3, pp. 718-730.View/Download from: Publisher's site
Zeng, Z, Li, Z, Cheng, D, Zhang, H, Zhan, K & Yang, Y 2018, 'Two-Stream Multi-Rate Recurrent Neural Network for Video-Based Pedestrian Re-Identification', IEEE Transactions on Industrial Informatics.View/Download from: Publisher's site
IEEE Video-based pedestrian re-identification is a fundamental task in video surveillance and real-world applications, and has attracted much research attention recently. Its goal is to match pedestrians across multiple non-overlapping network cameras. In this paper we propose a novel two-stream multi-rate recurrent neural network for video-based pedestrian re-identification, which has two inherent benefits: (1) capture the static spatial and temporal information; (2) deal with motion speed variance. Given video sequences of pedestrians, we start with extracting spatial and motion features using two different deep neural networks. Then we combine them using a regularized fusion network, which aims to explore feature correlations. To step further, we feed the two features into a multi-rate recurrent network to exploit the temporal correlations, and more importantly, to take into consideration that pedestrians, sometimes even the same pedestrian, move in different speeds across different camera views. Extensive experiments have conducted on two real-world video-based pedestrian re-identification benchmarks: iLIDS-VID and PRID 2011 datasets. The experimental results confirm the superiority of the proposed method. Our code will be released upon acceptance.
Zhang, S, Yang, Y, Xiao, J, Liu, X, Yang, Y, Xie, D & Zhuang, Y 2018, 'Fusing geometric features for skeleton-based action recognition using multilayer LSTM Networks', IEEE Transactions on Multimedia, vol. 20, no. 9, pp. 2330-2343.View/Download from: Publisher's site
© 1999-2012 IEEE. Recent skeleton-based action recognition approaches achieve great improvement by using recurrent neural network (RNN) models. Currently, these approaches build an end-to-end network from coordinates of joints to class categories and improve accuracy by extending RNN to spatial domains. First, while such well-designed models and optimization strategies explore relations between different parts directly from joint coordinates, we provide a simple universal spatial modeling method perpendicular to the RNN model enhancement. Specifically, according to the evolution of previous work, we select a set of simple geometric features, and then seperately feed each type of features to a three-layer LSTM framework. Second, we propose a multistream LSTM architecture with a new smoothed score fusion technique to learn classification from different geometric feature streams. Furthermore, we observe that the geometric relational features based on distances between joints and selected lines outperform other features and the fusion results achieve the state-of-the-art performance on four datasets. We also show the sparsity of input gate weights in the first LSTM layer trained by geometric features and demonstrate that utilizing joint-line distances as input require less data for training.
Zheng, L, Yang, Y & Tian, Q 2018, 'SIFT Meets CNN: A Decade Survey of Instance Retrieval', IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 5, pp. 1224-1244.View/Download from: Publisher's site
In the early days, content-based image retrieval (CBIR) was studied with global features. Since 2003, image retrieval based on local descriptors (de facto SIFT) has been extensively studied for over a decade due to the advantage of SIFT in dealing with image transformations. Recently, image representations based on the convolutional neural network (CNN) have attracted increasing interest in the community and demonstrated impressive performance. Given this time of rapid evolution, this article provides a comprehensive survey of instance retrieval over the last decade. Two broad categories, SIFT-based and CNN-based methods, are presented. For the former, according to the codebook size, we organize the literature into using large/medium-sized/small codebooks. For the latter, we discuss three lines of methods, i.e., using pre-trained or fine-tuned CNN models, and hybrid methods. The first two perform a single-pass of an image to the network, while the last category employs a patch-based feature extraction scheme. This survey presents milestones in modern instance retrieval, reviews a broad selection of previous works in different categories, and provides insights on the connection between SIFT and CNN-based methods. After analyzing and comparing retrieval performance of different categories on several datasets, we discuss promising directions towards generic and specialized instance retrieval.
Zheng, L, Yang, Y & Zheng, Z 2018, 'A Discriminatively Learned CNN Embedding for Person Re-identification', ACM Transactions on Multimedia Computing, Communications and Applications, vol. 14, no. 1, pp. 1-20.View/Download from: Publisher's site
In this article, we revisit two popular convolutional neural networks in person re-identification (re-ID): verification and identification models. The two models have their respective advantages and limitations due to different loss functions. Here, we shed light on how to combine the two models to learn more discriminative pedestrian descriptors. Specifically, we propose a Siamese network that simultaneously computes the identification loss and verification loss. Given a pair of training images, the network predicts the identities of the two input images and whether they belong to the same identity. Our network learns a discriminative embedding and a similarity measurement at the same time, thus taking full usage of the re-ID annotations. Our method can be easily applied on different pretrained networks. Albeit simple, the learned embedding improves the state-of-the-art performance on two public person re-ID benchmarks. Further, we show that our architecture can also be applied to image retrieval. The code is available at https://github.com/layumi/2016_person_re-ID.
Li, Z, Nie, F, Chang, X, Yang, Y, Zhang, C & Sebe, N 2018, 'Dynamic Affinity Graph Construction for Spectral Clustering Using Multiple Features.', IEEE Transactions on Neural Networks and Learning Systems, vol. 29, no. 12, pp. 6323-6332.View/Download from: Publisher's site
Spectral clustering (SC) has been widely applied to various computer vision tasks, where the key is to construct a robust affinity matrix for data partitioning. With the increase in visual features, conventional SC methods are facing two challenges: 1) how to effectively generate an affinity matrix based on multiple features? and 2) how to deal with high-dimensional visual features which could be redundant? To address these issues mentioned earlier, we present a new approach to: 1) learn a robust affinity matrix using multiple features, allowing us to simultaneously determine optimal weights for each feature; and 2) decide a set of optimal projection matrixes, one for each feature, that decide the lower dimensional space, as well as the optimal affinity weight of each data pair in the lower dimensional space. There are two major advantages of our new approach over the existing clustering techniques. First, our approach assigns affinity weights for data points on a per-data-pair basis. The learning procedure avoids the explicit specification of the size of the neighborhood in the affinity matrix, and the bandwidth parameter required to compute the Gaussian kernel, both of which are sensitive and yet difficult to determine beforehand. Second, the affinity weights are based on the distances in a lower dimensional space, while the low-dimensional space is inferred according to the optimized affinity weights. Both variables are jointly optimized so as to leverage mutual benefits. The experimental results outperform the compared alternatives, which indicate that the proposed method is effective in simultaneously learning the affinity graph and feature fusion, resulting in better clustering results.
Chang, X & Yang, Y 2017, 'Semisupervised Feature Analysis by Mining Correlations Among Multiple Tasks', IEEE Transactions on Neural Networks and Learning Systems, vol. 28, no. 10, pp. 2294-2305.View/Download from: Publisher's site
In this paper, we propose a novel semisupervised feature selection framework by mining correlations among multiple tasks and apply it to different multimedia applications. Instead of independently computing the importance of features for each task, our algorithm leverages shared knowledge from multiple related tasks, thus improving the performance of feature selection. Note that the proposed algorithm is built upon an assumption that different tasks share some common structures. The proposed algorithm selects features in a batch mode, by which the correlations between various features are taken into consideration. Besides, considering the fact that labeling a large amount of training data in real world is both time-consuming and tedious, we adopt manifold learning, which exploits both labeled and unlabeled training data for a feature space analysis. Since the objective function is nonsmooth and difficult to solve, we propose an iteractive algorithm with fast convergence. Extensive experiments on different applications demonstrate that our algorithm outperforms the other state-of-the-art feature selection algorithms.
Chang, X, Ma, Z, Lin, M, Yang, Y & Hauptmann, AG 2017, 'Feature Interaction Augmented Sparse Learning for Fast Kinect Motion Detection', IEEE Transactions on Image Processing, vol. 26, no. 8, pp. 3911-3920.View/Download from: Publisher's site
© 2017 IEEE. The Kinect sensing devices have been widely used in current Human-Computer Interaction entertainment. A fundamental issue involved is to detect users' motions accurately and quickly. In this paper, we tackle it by proposing a linear algorithm, which is augmented by feature interaction. The linear property guarantees its speed whereas feature interaction captures the higher order effect from the data to enhance its accuracy. The Schatten-p norm is leveraged to integrate the ma in linear effect and the higher order nonlinear effect by mining the correlation between them. The resulted classification model is a desirable combination of speed and accuracy. We propose a novel solution to solve our objective function. Experiments are performed on three public Kinect-based entertainment data sets related to fitness and gaming. The results show that our method has its advantage for motion detection in a real-time Kinect entertaining environment.
Chang, X, Ma, Z, Yang, Y, Zeng, Z & Hauptmann, AG 2017, 'Bi-level semantic representation analysis for multimedia event detection', IEEE Transactions on Cybernetics, vol. 47, no. 5, pp. 1180-1197.View/Download from: Publisher's site
© 2013 IEEE. Multimedia event detection has been one of the major endeavors in video event analysis. A variety of approaches have been proposed recently to tackle this problem. Among others, using semantic representation has been accredited for its promising performance and desirable ability for human-understandable reasoning. To generate semantic representation, we usually utilize several external image/video archives and apply the concept detectors trained on them to the event videos. Due to the intrinsic difference of these archives, the resulted representation is presumable to have different predicting capabilities for a certain event. Notwithstanding, not much work is available for assessing the efficacy of semantic representation from t he source-level. On the other hand, it is plausible to perceive that some concepts are noisy for detecting a specific event. Motivated by these two shortcomings, we propose a bi-level semantic representation analyzing method. Regarding source-level, our method learns weights of semantic representation attained from different multimedia archives. Meanwhile, it restrains the negative influence of noisy or irrelevant concepts in the overall concept-level. In addition, we particularly focus on efficient multimedia event detection with few positive examples, which is highly appreciated in the real-world scenario. We perform extensive experiments on the challenging TRECVID MED 2013 and 2014 datasets with encouraging results that validate the efficacy of our proposed approach.
Chang, X, Yu, YL, Yang, Y & Xing, EP 2017, 'Semantic Pooling for Complex Event Analysis in Untrimmed Videos', IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 8, pp. 1617-1632.View/Download from: Publisher's site
© 1979-2012 IEEE. Pooling plays an important role in generating a discriminative video representation. In this paper, we propose a new semantic pooling approach for challenging event analysis tasks (e.g., event detection, recognition, and recounting) in long untrimmed Internet videos, especially when only a few shots/segments are relevant to the event of interest while many other shots are irrelevant or even misleading. The commonly adopted pooling strategies aggregate the shots indifferently in one way or another, resulting in a great loss of information. Instead, in this work we first define a novel notion of semantic saliency that assesses the relevance of each shot with the event of interest. We then prioritize the shots according to their saliency scores since shots that are semantically more salient are expected to contribute more to the final event analysis. Next, we propose a new isotonic regularizer that is able to exploit the constructed semantic ordering information. The resulting nearly-isotonic support vector machine classifier exhibits higher discriminative power in event analysis tasks. Computationally, we develop an efficient implementation using the proximal gradient algorithm, and we prove new and closed-form proximal steps. We conduct extensive experiments on three real-world video datasets and achieve promising improvements.
Li, Z, Nie, F, Chang, X & Yang, Y 2017, 'Beyond Trace Ratio: Weighted Harmonic Mean of Trace Ratios for Multiclass Discriminant Analysis', IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, vol. 29, no. 10, pp. 2100-2110.View/Download from: Publisher's site
Luo, M, Nie, F, Chang, X, Yang, Y, Hauptmann, AG & Zheng, Q 2017, 'Avoiding Optimal Mean ℓ2,1-Norm Maximization-Based Robust PCA for Reconstruction.', Neural Computation, vol. 29, no. 4, pp. 1124-1150.View/Download from: Publisher's site
Robust principal component analysis (PCA) is one of the most important dimension-reduction techniques for handling high-dimensional data with outliers. However, most of the existing robust PCA presupposes that the mean of the data is zero and incorrectly utilizes the average of data as the optimal mean of robust PCA. In fact, this assumption holds only for the squared [Formula: see text]-norm-based traditional PCA. In this letter, we equivalently reformulate the objective of conventional PCA and learn the optimal projection directions by maximizing the sum of projected difference between each pair of instances based on [Formula: see text]-norm. The proposed method is robust to outliers and also invariant to rotation. More important, the reformulated objective not only automatically avoids the calculation of optimal mean and makes the assumption of centered data unnecessary, but also theoretically connects to the minimization of reconstruction error. To solve the proposed nonsmooth problem, we exploit an efficient optimization algorithm to soften the contributions from outliers by reweighting each data point iteratively. We theoretically analyze the convergence and computational complexity of the proposed algorithm. Extensive experimental results on several benchmark data sets illustrate the effectiveness and superiority of the proposed method.
© 1999-2012 IEEE. Complex event detection has been progressively researched in recent years for the broad interest of video indexing and retrieval. To fulfill the purpose of event detection, one needs to train a classifier using both positive and negative examples. Current classifier training treats the negative videos as equally negative. However, we notice that many negative videos resemble the positive videos in different degrees. Intuitively, we may capture more informative cues from the negative videos if we assign them fine-grained labels, thus benefiting the classifier learning. Aiming for this, we use a statistical method on both the positive and negative examples to get the decisive attributes of a specific event. Based on these decisive attributes, we assign the fine-grained labels to negative examples to treat them differently for more effective exploitation. The resulting fine-grained labels may be not optimal to capture the discriminative cues from the negative videos. Hence, we propose to jointly optimize the fine-grained labels with the classifier learning, which brings mutual reciprocality. Meanwhile, the labels of positive examples are supposed to remain unchanged. We thus additionally introduce a constraint for this purpose. On the other hand, the state-of-the-art deep convolutional neural network features are leveraged in our approach for event detection to further boost the performance. Extensive experiments on the challenging TRECVID MED 2014 dataset have validated the efficacy of our proposed approach.
Nie, L, Wei, X, Zhang, D, Wang, X, Gao, Z & Yang, Y 2017, 'Data-Driven Answer Selection in Community QA Systems', IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, vol. 29, no. 6, pp. 1186-1198.View/Download from: Publisher's site
Wu, F, Wang, Z, Lu, W, Li, X, Yang, Y, Luo, J & Zhuang, Y 2017, 'Regularized Deep Belief Network for Image Attribute Detection', IEEE Transactions on Circuits and Systems for Video Technology, vol. 27, no. 7, pp. 1464-1477.View/Download from: Publisher's site
© 1991-2012 IEEE. In general, an image attribute is a human-nameable visual property that has a semantic connotation. Appropriate modeling of the intrinsic contextual correlations among attributes plays a fundamental role in attribute detection. In this paper, we consider image attribute detection from the perspective of regularized deep learning. In particular, we propose a regularized deep belief network (rDBN) to perform the image attribute detection task, which is composed of two parts: 1) a detection DBN (dDBN) that models the joint distribution of images and their corresponding attributes, which acts as an attribute detector and 2) a contextual restricted Boltzmann machine that explicitly models the correlations among attributes acting as a regularizer that restraints the output detection result given by the dDBN to meet the contextual prior of attributes. Furthermore, we propose an efficient fine-tuning scheme that can further optimize the performance of the dDBN by backpropagation. Experimental results show that the proposed rDBN obtains improvements over the state-of-the-art methods for attribute detection on the benchmark data sets.
Zhu, L, Xu, Z, Yang, Y & Hauptmann, AG 2017, 'Uncovering the Temporal Context for Video Question Answering', International Journal of Computer Vision, vol. 124, pp. 409-421.View/Download from: Publisher's site
Zhuang, Y, Wang, H, Xiao, J, Wu, F, Yang, Y, Lu, W & Zhang, Z 2017, 'Bag-of-Discriminative-Words (BoDW) Representation via Topic Modeling', IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, vol. 29, no. 5, pp. 977-990.View/Download from: Publisher's site
Chen, L, Li, X, Yang, Y, Kurniawati, H, Sheng, QZ, Hu, HY & Huang, N 2016, 'Personal health indexing based on medical examinations: A data mining approach', Decision Support Systems, vol. 81, pp. 54-65.View/Download from: Publisher's site
We design a method called MyPHI that predicts personal health index (PHI), a new evidence-based health indicator to explore the underlying patterns of a large collection of geriatric medical examination (GME) records using data mining techniques. We define PHI as a vector of scores, each reflecting the health risk in a particular disease category. The PHI prediction is formulated as an optimization problem that finds the optimal soft labels as health scores based on medical records that are infrequent, incomplete, and sparse. Our method is compared with classification models commonly used in medical applications. The experimental evaluation has demonstrated the effectiveness of our method based on a real-world GME data set collected from 102,258 participants.
Gan, C, Yang, Y, Zhu, L, Zhao, D & Zhuang, Y 2016, 'Recognizing an Action Using Its Name: A Knowledge-Based Approach', International Journal of Computer Vision, vol. 120, pp. 61-77.View/Download from: Publisher's site
Han, Y, Yang, Y & Zhou, X 2016, 'Guest editorial: web multimedia semantic inference using multi-cues', WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS, vol. 19, no. 2, pp. 177-179.View/Download from: Publisher's site
Wu, F, Fang, H, Li, X, Tang, S, Lu, W, Yang, Y, Zhu, W & Zhuang, Y 2016, 'Aspect Learning for Multimedia Summarization via Nonparametric Bayesian', IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, vol. 26, no. 10, pp. 1931-1942.View/Download from: Publisher's site
Xia, Y, Nie, L, Zhang, L, Yang, Y, Hong, R & Li, X 2016, 'Weakly Supervised Multilabel Clustering and its Applications in Computer Vision', IEEE TRANSACTIONS ON CYBERNETICS, vol. 46, no. 12, pp. 3220-3232.View/Download from: Publisher's site
Yan, Y, Nie, F, Li, W, Gao, C, Yang, Y & Xu, D 2016, 'Image Classification by Cross-Media Active Learning With Privileged Information', IEEE TRANSACTIONS ON MULTIMEDIA, vol. 18, no. 12, pp. 2494-2502.View/Download from: Publisher's site
Yu, SI, Xu, S, Ma, Z, Li, H, Hauptmann, AG, Chang, X, Yang, Y, Meng, D, Lin, M, Lan, Z, Gan, C, Xu, Z, Mao, Z, Li, X, Jiang, L & Du, X 2016, 'Strategies for searching video content with text queries or video examples', ITE Transactions on Media Technology and Applications, vol. 4, no. 3, pp. 227-238.
© 2016 by ITE Transactions on Media Technology and Applications (MTA). The large number of user-generated videos uploaded on to the Internet everyday has led to many commercial video search engines, which mainly rely on text metadata for search. However, metadata is often lacking for user-generated videos, thus these videos are unsearchable by current search engines. Therefore, content-based video retrieval (CBVR) tackles this metadata-scarcity problem by directly analyzing the visual and audio streams of each video. CBVR encompasses multiple research topics, including low-level feature design, feature fusion, semantic detector training and video search/reranking. We present novel strategies in these topics to enhance CBVR in both accuracy and speed under different query inputs, including pure textual queries and query by video examples. Our proposed strategies have been incorporated into our submission for the TRECVID 2014 Multimedia Event Detection evaluation, where our system outperformed other submissions in both text queries and video example queries, thus demonstrating the effectiveness of our proposed approaches.
Automatically predicting human eye fixations is a
useful technique that can facilitate many multimedia applications,
e.g., image retrieval, action recognition, and photo
retargeting. Conventional approaches are frustrated by two
drawbacks. First, psychophysical experiments show that an
object-level interpretation of scenes influences eye movements
significantly. Most of the existing saliency models rely on object
detectors, and therefore, only a few prespecified categories can
be discovered. Second, the relative displacement of objects influences
their saliency remarkably, but current models cannot
describe them explicitly. To solve these problems, this paper
proposes weakly supervised fixations prediction, which leverages
image labels to improve accuracy of human fixations prediction.
The proposed model hierarchically discovers objects as well as
their spatial configurations. Starting from the raw image pixels,
we sample superpixels in an image, thereby seamless object
descriptors termed object-level graphlets (oGLs) are generated
by random walking on the superpixel mosaic. Then, a manifold
embedding algorithm is proposed to encode image labels
into oGLs, and the response map of each prespecified object is
computed accordingly. On the basis of the object-level response
map, we propose spatial-level graphlets (sGLs) to model the relative
positions among objects. Afterward, eye tracking data is
employed to integrate these sGLs for predicting human eye fixations.
Thorough experiment results demonstrate the advantage
of the proposed method over the state-of-the-art.
Zhang, L, Yang, Y, Nie, F & Shao, L 2016, 'Perception, Aesthetics, and Emotion in Multimedia Quality Modeling Introduction', IEEE MULTIMEDIA, vol. 23, no. 3, pp. 20-22.View/Download from: Publisher's site
Chang, X, Nie, F, Wang, S, Yang, Y, Zhou, X & Zhang, C 2016, 'Compound Rank-k Projections for Bilinear Analysis', IEEE Transactions on Neural Networks and Learning Systems, vol. 27, no. 7, pp. 1502-1513.View/Download from: Publisher's site
In many real-world applications, data are represented by matrices or
high-order tensors. Despite the promising performance, the existing
two-dimensional discriminant analysis algorithms employ a single projection
model to exploit the discriminant information for projection, making the model
less flexible. In this paper, we propose a novel Compound Rank-k Projection
(CRP) algorithm for bilinear analysis. CRP deals with matrices directly without
transforming them into vectors, and it therefore preserves the correlations
within the matrix and decreases the computation complexity. Different from the
existing two dimensional discriminant analysis algorithms, objective function
values of CRP increase monotonically.In addition, CRP utilizes multiple rank-k
projection models to enable a larger search space in which the optimal solution
can be found. In this way, the discriminant ability is enhanced.
Chang, X, Nie, F, Yang, Y, Zhang, C & Huang, H 2016, 'Convex Sparse PCA for Unsupervised Feature Learning', ACM Transactions on Knowledge Discovery from Data, vol. 11, no. 1, pp. 1-16.View/Download from: Publisher's site
Principal component analysis (PCA) has been widely applied to dimensionality reduction and data preprocessing
for different applications in engineering, biology, social science, and the like. Classical PCA
and its variants seek for linear projections of the original variables to obtain the low-dimensional feature
representations with maximal variance. One limitation is that it is difficult to interpret the results of PCA.
Besides, the classical PCA is vulnerable to certain noisy data. In this paper, we propose a Convex Sparse
Principal Component Analysis (CSPCA) algorithm and apply it to feature learning. First, we show that PCA
can be formulated as a low-rank regression optimization problem. Based on the discussion, the l2,1-norm
minimization is incorporated into the objective function to make the regression coefficients sparse, thereby
robust to the outliers. Also, based on the sparse model used in CSPCA, an optimal weight is assigned
to each of the original feature, which in turn provides the output with good interpretability. With the
output of our CSPCA, we can effectively analyze the importance of each feature under the PCA criteria.
Our new objective function is convex, and we propose an iterative algorithm to optimize it. We apply the
CSPCA algorithm to feature selection and conduct extensive experiments on seven benchmark datasets.
Experimental results demons
Han, Y, Yang, Y & Wang, J 2015, 'Guest Editorial: Ad Hoc Web Multimedia Analysis with Limited Supervision', MULTIMEDIA TOOLS AND APPLICATIONS, vol. 74, no. 2, pp. 463-465.View/Download from: Publisher's site
Han, Y, Yang, Y, Wu, F & Hong, R 2015, 'Compact and Discriminative Descriptor Inference Using Multi-Cues', IEEE TRANSACTIONS ON IMAGE PROCESSING, vol. 24, no. 12, pp. 5114-5126.View/Download from: Publisher's site
Han, Y, Yang, Y, Yan, Y, Ma, Z, Sebe, N & Zhou, X 2015, 'Semisupervised Feature Selection via Spline Regression for Video Semantic Recognition', IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, vol. 26, no. 2, pp. 252-264.View/Download from: Publisher's site
Wu, F, Wang, Z, Zhang, Z, Yang, Y, Luo, J, Zhu, W & Zhuang, Y 2015, 'Weakly Semi-Supervised Deep Learning for Multi-Label Image Annotation', IEEE Transactions on Big Data, vol. 1, no. 3, pp. 109-122.View/Download from: Publisher's site
Yan, Y, Yang, Y, Meng, D, Liu, G, Tong, W, Hauptmann, AG & Sebe, N 2015, 'Event Oriented Dictionary Learning for Complex Event Detection', IEEE TRANSACTIONS ON IMAGE PROCESSING, vol. 24, no. 6, pp. 1867-1878.View/Download from: Publisher's site
Yang, Y, Ma, Z, Nie, F, Chang, X & Hauptmann, AG 2015, 'Multi-Class Active Learning by Uncertainty Sampling with Diversity Maximization', International Journal of Computer Vision, vol. 113, no. 2.View/Download from: Publisher's site
As a way to relieve the tedious work of manual annotation, active learning plays important roles in many applications of visual concept recognition. In typical active learning scenarios, the number of labelled data in the seed set is usually small. However, most existing active learning algorithms only exploit the labelled data, which often suffers from over-fitting due to the small number of labelled examples. Besides, while much progress has been made in binary class active learning, little research attention has been focused on multi-class active learning. In this paper, we propose a semi-supervised batch mode multi-class active learning algorithm for visual concept recognition. Our algorithm exploits the whole active pool to evaluate the uncertainty of the data. Considering that uncertain data are always similar to each other, we propose to make the selected data as diverse as possible, for which we explicitly impose a diversity constraint on the objective function. As a multi-class active learning algorithm, our algorithm is able to exploit uncertainty across multiple classes. An efficient algorithm is used to optimize the objective function. Extensive experiments on action recognition, object classification, scene recognition, and event detection demonstrate its advantages.
Yang, Y, Ma, Z, Yang, Y, Nie, F & Shen, HT 2015, 'Multitask Spectral Clustering by Exploring Intertask Correlation', IEEE TRANSACTIONS ON CYBERNETICS, vol. 45, no. 5, pp. 1069-1080.View/Download from: Publisher's site
Han, Y, Wei, X, Cao, X, Yang, Y & Zhou, X 2014, 'Augmenting image descriptions using structured prediction output', IEEE Transactions on Multimedia, vol. 16, no. 6, pp. 1665-1676.View/Download from: Publisher's site
© 2014 IEEE. The need for richer descriptions of images arises in a wide spectrum of applications ranging from image understanding to image retrieval. While the Automatic Image Annotation (AIA) has been extensively studied, image descriptions with the output labels lack sufficient information. This paper proposes to augment image descriptions using structured prediction output. We define a hierarchical tree-structured semantic unit to describe images, from which we can obtain not only the class and subclass one image belongs to, but also the attributes one image has. After defining a new feature map function of structured SVM, we decompose the loss function into every node of the hierarchical tree-structured semantic unit and then predict the tree-structured semantic unit for testing images. In the experiments, we evaluate the performance of the proposed method on two open benchmark datasets and compare with the state-of-the-art methods. Experimental results show the better prediction performance of the proposed method and demonstrate the strength of augmenting image descriptions.
Visual attributes can be considered as a middle-level semantic cue that bridges the gap between low-level image features and high-level object classes. Thus, attributes have the advantage of transcending specific semantic categories or describing objects across categories. Since attributes are often human-nameable and domain specific, much work constructs attribute annotations ad hoc or take them from an application-dependent ontology. To facilitate other applications with attributes, it is necessary to develop methods which can adapt a well-defined set of attributes to novel images. In this paper, we propose a framework for image attribute adaptation. The goal is to automatically adapt the knowledge of attributes from a well-defined auxiliary image set to a target image set, thus assisting in predicting appropriate attributes for target images. In the proposed framework, we use a non-linear mapping function corresponding to multiple base kernels to map each training images of both the auxiliary and the target sets to a Reproducing Kernel Hilbert Space (RKHS), where we reduce the mismatch of data distributions between auxiliary and target images. In order to make use of un-labeled images, we incorporate a semi-supervised learning process. We also introduce a robust loss function into our framework to remove the shared irrelevance and noise of training images. Experiments on two couples of auxiliary-target image sets demonstrate that the proposed framework has better performance of predicting attributes for target testing images, compared to three baselines and two state-of-the-art domain adaptation methods. © 2014 IEEE.
With the advance of the Web 2.0 era came an explosive growth of geographical multimedia data shared on social network websites such as Flickr, YouTube, Facebook, and Zooomr. Location-aware media description, modeling, learning, and recommendation in pervasive social media analytics have become a key focus of the recent research in computer vision, multimedia, and signal processing societies. A new breed of multimedia applications that incorporates image/video annotation, visual search, content mining and recommendation, and so on may revolutionize the field. Combined with the popularity of location-aware social multimedia, location context data makes traditionally challenging problems more tractable. This special issue brings together active researchers to share recent progress in this exciting area. This issue highlights the latest developments in large-scale multiple evidence-based learning for geosocial multimedia computing and identifies several key challenges and potential innovations. © 2014 IEEE.
Li, P, Bu, J, Yang, Y, Ji, R, Chen, C & Cai, D 2014, 'Discriminative Orthogonal Nonnegative matrix factorization with flexibility for data representation', Expert Systems with Applications, vol. 41, no. 4, Part 1, pp. 1283-1293.View/Download from: Publisher's site
Learning an informative data representation is of vital importance in multidisciplinary applications, e.g., face analysis, document clustering and collaborative filtering. As a very useful tool, Nonnegative matrix factorization (NMF) is often employed to learn a well-structured data representation. While the geometrical structure of the data has been studied in some previous NMF variants, the existing works typically neglect the discriminant information revealed by the between-class scatter and the total scatter of the data. To address this issue, we present a novel approach named Discriminative Orthogonal Nonnegative matrix factorization (DON), which preserves both the local manifold structure and the global discriminant information simultaneously through manifold discriminant learning. In particular, to learn the discriminant structure for the data representation, we introduce the scaled indicator matrix, which naturally satisfies the orthogonality condition. Thus, we impose the orthogonality constraints on the objective function. However, too heavy constraints will lead to a very sparse data representation that is unexpected in reality. So we further make this orthogonality flexible. In addition, we provide the optimization framework with the convergence proof of the updating rules. Extensive comparisons over several state-of-the-art approaches demonstrate the efficacy of the proposed method. © 2013 Elsevier Ltd. All rights reserved.
Li, Z, Liu, J, Yang, Y, Zhou, X & Lu, H 2014, 'Clustering-Guided Sparse Structural Learning for Unsupervised Feature Selection', IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, vol. 26, no. 9, pp. 2138-2150.View/Download from: Publisher's site
Liu, J, Yang, Y, Huang, Z, Yang, Y & Shen, HT 2014, 'On the influence propagation of web videos', IEEE Transactions on Knowledge and Data Engineering, vol. 26, no. 8, pp. 1961-1973.View/Download from: Publisher's site
We propose a novel approach to analyze how a popular video is propagated in the cyberspace, to identify if it originated from a certain sharing-site, and to identify how it reached the current popularity in its propagation. In addition, we also estimate their influences across different websites outside the major hosting website. Web video is gaining significance due to its rich and eye-ball grabbing content. This phenomenon is evidently amplified and accelerated by the advance of Web 2.0. When a video receives some degree of popularity, it tends to appear on various websites including not only video-sharing websites but also news websites, social networks or even Wikipedia. Numerous video-sharing websites have hosted videos that reached a phenomenal level of visibility and popularity in the entire cyberspace. As a result, it is becoming more difficult to determine how the propagation took place-was the video a piece of original work that was intentionally uploaded to its major hosting site by the authors, or did the video originate from some small site then reached the sharing site after already getting a good level of popularity, or did it originate from other places in the cyberspace but the sharing site made it popular. Existing study regarding this flow of influence is lacking. Literature that discuss the problem of estimating a video's influence in the whole cyberspace also remains rare. In this article we introduce a novel framework to identify the propagation of popular videos from its major hosting site's perspective, and to estimate its influence. We define a Unified Virtual Community Space (UVCS) to model the propagation and influence of a video, and devise a novel learning method called Noise-reductive Local-and-Global Learning (NLGL) to effectively estimate a video's origin and influence. Without losing generality, we conduct experiments on annotated dataset collected from a major video sharing site to evaluate the effectiveness of the framework. Sur...
Liu, J, Zhang, P, Yu, T, Yang, Y & Qiu, H 2014, '[Effects of losartan on pulmonary dendritic cells in lipopolysaccharide- induced acute lung injury mice].', Zhonghua yi xue za zhi, vol. 94, no. 41, pp. 3216-3219.
OBJECTIVE: To assess the effects of losartan on the frequency and phenotype of respiratory dendritic cells (DC) in lipopolysaccharide (LPS)-induced acute lung injury (ALI) mice. METHODS: The C57BL/6 mice were randomly divided into 3 groups of control, ALI and ALI+losartan. ALI animals received 2 mg/kg of LPS; ALI+losartan animals 2 mg/kg of LPS and 15 mg/kg of losartan 30 min before an intratracheal injection of LPS; control animals phosphate buffer saline (PBS) instead of LPS. Lung wet weight/body weight (LW/BW) was recorded to assess lung injury. The pathological changes were examined under optical microscope. The frequency and phenotype of pulmonary DC were characterized by flow cytometry. Meanwhile, the levels of IL-6 in lung homogenates were assessed by enzyme-linked immunosorbent assay (ELISA). RESULTS: (1) The LPS-induced rise in LW/BW was partially prevented by a pretreatment of losartan. (2) Histologically, widespread alveolar wall thickening caused by edema, severe hemorrhage in interstitium and alveolus and marked and diffuse interstitial infiltration of inflammatory cells were observed in the ALI group. Whereas, losartan effectively attenuated the LPS-induced pulmonary hemorrhage, leukocytic infiltration in interstitium and alveolus. (3) Meanwhile, the levels of IL-6 in lung tissue were significantly enhanced in the LPS-induced ALI mice. Yet after a pretreatment of losartan, the pulmonary level of IL-6 markedly decreased. (4) LPS dosing resulted in a rapid accumulation of DC in lung tissues and an up-regulated expression of CD80 in LPS-induced ALI. In contrast, the expression of MHC II on respiratory DC was not significantly different among groups. A pretreatment of losartan led to a marked reduction in CD80 expression on pulmonary DC (P < 0.05 vs ALI). CONCLUSION: Losartan may down-regulate pulmonary injury by inhibiting the activation of pulmonary DC.
Ma, Z, Yang, Y, Nie, F, Sebe, N, Yan, S & Hauptmann, AG 2014, 'Harnessing lab knowledge for real-world action recognition', International Journal of Computer Vision, vol. 109, no. 1-2, pp. 60-73.View/Download from: Publisher's site
Much research on human action recognition has been oriented toward the performance gain on lab-collected datasets. Yet real-world videos are more diverse, with more complicated actions and often only a few of them are precisely labeled. Thus, recognizing actions from these videos is a tough mission. The paucity of labeled real-world videos motivates us to "borrow" strength from other resources. Specifically, considering that many lab datasets are available, we propose to harness lab datasets to facilitate the action recognition in real-world videos given that the lab and real-world datasets are related. As their action categories are usually inconsistent, we design a multi-task learning framework to jointly optimize the classifiers for both sides. The general Schatten $$p$ $ p -norm is exerted on the two classifiers to explore the shared knowledge between them. In this way, our framework is able to mine the shared knowledge between two datasets even if the two have different action categories, which is a major virtue of our method. The shared knowledge is further used to improve the action recognition in the real-world videos. Extensive experiments are performed on real-world datasets with promising results. © 2014 Springer Science+Business Media New York.
Ma, Z, Yang, Y, Sebe, N & Hauptmann, AG 2014, 'Knowledge adaptation with partially shared features for event detection using few exemplars', IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 9, pp. 1789-1802.View/Download from: Publisher's site
Multimedia event detection (MED) is an emerging area of research. Previous work mainly focuses on simple event detection in sports and news videos, or abnormality detection in surveillance videos. In contrast, we focus on detecting more complicated and generic events that gain more users' interest, and we explore an effective solution for MED. Moreover, our solution only uses few positive examples since precisely labeled multimedia content is scarce in the real world. As the information from these few positive examples is limited, we propose using knowledge adaptation to facilitate event detection. Different from the state of the art, our algorithm is able to adapt knowledge from another source for MED even if the features of the source and the target are partially different, but overlapping. Avoiding the requirement that the two domains are consistent in feature types is desirable as data collection platforms change or augment their capabilities and we should be able to respond to this with little or no effort. We perform extensive experiments on real-world multimedia archives consisting of several challenging events. The results show that our approach outperforms several other state-of-the-art detection algorithms. © 1979-2012 IEEE.
Mu, Y, Yang, Y, Cao, L, Yan, S & Tian, Q 2014, 'Guest Editorial: Special issue on large scale multimedia semantic indexing', COMPUTER VISION AND IMAGE UNDERSTANDING, vol. 124, pp. 1-2.View/Download from: Publisher's site
Song, J, Yang, Y, Li, X, Huang, Z & Yang, Y 2014, 'Robust Hashing with Local Models for Approximate Similarity Search', IEEE TRANSACTIONS ON CYBERNETICS, vol. 44, no. 7, pp. 1225-1236.View/Download from: Publisher's site
Tong, W, Yang, Y, Jiang, L, Yu, SI, Lan, Z, Ma, Z, Sze, W, Younessian, E & Hauptmann, AG 2014, 'E-LAMP: Integration of innovative ideas for multimedia event detection', Machine Vision and Applications, vol. 25, no. 1, pp. 5-15.View/Download from: Publisher's site
Detecting multimedia events in web videos is an emerging hot research area in the fields of multimedia and computer vision. In this paper, we introduce the core methods and technologies of the framework we developed recently for our Event Labeling through Analytic Media Processing (E-LAMP) system to deal with different aspects of the overall problem of event detection. More specifically, we have developed efficient methods for feature extraction so that we are able to handle large collections of video data with thousands of hours of videos. Second, we represent the extracted raw features in a spatial bag-of-words model with more effective tilings such that the spatial layout information of different features and different events can be better captured, thus the overa ll detection performance can be improved. Third, different from widely used early and late fusion schemes, a novel algorithm is developed to learn a more robust and discriminative intermediate feature representation from multiple features so that better event models can be built upon it. Finally, to tackle the additional challenge of event detection with only very few positive exemplars, we have developed a novel algorithm which is able to effectively adapt the knowledge learnt from auxiliary sources to assist the event detection. Both our empirical results and the official evaluation results on TRECVID MED'11 and MED'12 demonstrate the excellent performance of the integration of these ideas. © 2013 Springer-Verlag Berlin Heidelberg.
Wang, S, Ma, Z, Yang, Y, Li, X, Pang, C & Hauptmann, AG 2014, 'Semi-supervised multiple feature analysis for action recognition', IEEE Transactions on Multimedia, vol. 16, no. 2, pp. 289-298.View/Download from: Publisher's site
This paper presents a semi-supervised method for categorizing human actions using multiple visual features. The proposed algorithm simultaneously learns multiple features from a small number of labeled videos, and automatically utilizes data distributions between labeled and unlabeled data to boost the recognition performance. Shared structural analysis is applied in our approach to discover a common subspace shared by each type of feature. In the subspace, the proposed algorithm is able to characterize more discriminative information of each feature type. Additionally, data distribution information of each type of feature has been preserved. The aforementioned attributes make our algorithm robust for action recognition, especially when only limited labeled training samples are provided. Extensive experiments have been conducted on both the choreographed and the realistic video datasets, including KTH, Youtube action and UCF50. Experimental results show that our method outperforms several state-of-the-art algorithms. Most notably, much better performances have been achieved when there are only a few labeled training samples. © 1999-2012 IEEE.
Learning hash functions across heterogenous high-dimensional features is very desirable for many applications involving multi-modal data objects. In this paper, we propose an approach to obtain the sparse codesets for the data objects across different modalities via joint multi-modal dictionary learning, which we call sparse multi-modal hashing (abbreviated as SM 2 . In SM 2 , both intra-modality similarity and inter-modality similarity are first modeled by a hypergraph, then multi-modal dictionaries are jointly learned by Hypergraph Laplacian sparse coding. Based on the learned dictionaries, the sparse codeset of each data object is acquired and conducted for multi-modal approximate nearest neighbor retrieval using a sensitive Jaccard metric. The experimental results show that SM 2 outperforms other methods in terms of mAP and Percentage on two real-world data sets. © 2013 IEEE.
Yang, Y, Sebe, N, Snoek, C, Hua, X-S & Zhuang, Y 2014, 'Special section on learning from multiple evidences for large scale multimedia analysis', COMPUTER VISION AND IMAGE UNDERSTANDING, vol. 118, pp. 1-1.View/Download from: Publisher's site
Photo cropping is widely used in the printing industry, photography, and cinematography. Conventional photo cropping methods suffer from three drawbacks: 1) the semantics used to describe photo aesthetics are determined by the experience of model designers and specific data sets, 2) image global configurations, an essential cue to capture photos aesthetics, are not well preserved in the cropped photo, and 3) multi-channel visual features from an image region contribute differently to human aesthetics, but state-of-the-art photo cropping methods cannot automatically weight them. Owing to the recent progress in image retrieval community, image-level semantics, i.e., photo labels obtained without much human supervision, can be efficiently and effectively acquired. Thus, we propose weakly supervised photo cropping, where a manifold embedding algorithm is developed to incorporate image-level semantics and image global configurations with graphlets, or, small-sized connected subgraph. After manifold embedding, a Bayesian Network (BN) is proposed. It incorporates the testing photo into the framework derived from the multi-channel post-embedding graphlets of the training data, the importance of which is determined automatically. Based on the BN, photo cropping can be casted as searching the candidate cropped photo that maximally preserves graphlets from the training photos, and the optimal cropping parameter is inferred by Gibbs sampling. Subjective evaluations demonstrate that: 1) our approach outperforms several representative photo cropping methods, including our previous cropping model that is guided by semantics-free graphlets, and 2) the visualized graphlets explicitly capture photo semantics and global spatial configurations. © 1999-2012 IEEE.
Zhang, L, Yang, Y, Gao, Y, Yu, Y, Wang, C & Li, X 2014, 'A probabilistic associative model for segmenting weakly supervised images', IEEE Transactions on Image Processing, vol. 23, no. 9, pp. 4150-4159.View/Download from: Publisher's site
Weakly supervised image segmentation is an important yet challenging task in image processing and pattern recognition fields. It is defined as: in the training stage, semantic labels are only at the image-level, without regard to their specific object/scene location within the image. Given a test image, the goal is to predict the semantics of every pixel/superpixel. In this paper, we propose a new weakly supervised image segmentation model, focusing on learning the semantic associations between superpixel sets (graphlets in this paper). In particular, we first extract graphlets from each image, where a graphlet is a small-sized graph measures the potential of multiple spatially neighboring superpixels (i.e., the probability of these superpixels sharing a common semantic label, such as the sky or the sea). To compare different-sized graphlets and to incorporate image-level labels, a manifold embedding algorithm is designed to transform all graphlets into equal-length feature vectors. Finally, we present a hierarchical Bayesian network to capture the semantic associations between postembedding graphlets, based on which the semantics of each superpixel is inferred accordingly. Experimental results demonstrate that: 1) our approach performs competitively compared with the state-of-the-art approaches on three public data sets and 2) considerable performance enhancement is achieved when using our approach on segmentation-based photo cropping and image categorization. © 2014 IEEE.
Cao, X, Wei, X, Han, Y, Yang, Y, Sebe, N & Hauptmann, A 2013, 'Unified Dictionary Learning and Region Tagging with Hierarchical Sparse Representation', COMPUTER VISION AND IMAGE UNDERSTANDING, vol. 117, no. 8, pp. 934-946.View/Download from: Publisher's site
Gao, C, Meng, D, Yang, Y, Wang, Y, Zhou, X & Hauptmann, AG 2013, 'Infrared Patch-Image Model for Small Target Detection in a Single Image', IEEE TRANSACTIONS ON IMAGE PROCESSING, vol. 22, no. 12, pp. 4996-5009.View/Download from: Publisher's site
Liang, Z, Zhuang, Y, Yang, Y & Xiao, J 2013, 'Retrieval-based cartoon gesture recognition and applications via semi-supervised heterogeneous classifiers learning', PATTERN RECOGNITION, vol. 46, no. 1, pp. 412-423.View/Download from: Publisher's site
Ma, Z, Yang, Y, Sebe, N, Zheng, K & Hauptmann, AG 2013, 'Multimedia Event Detection Using A Classifier-Specific Intermediate Representation', IEEE TRANSACTIONS ON MULTIMEDIA, vol. 15, no. 7, pp. 1628-1637.View/Download from: Publisher's site
Song, J, Yang, Y, Huang, Z, Shen, HT & Luo, J 2013, 'Effective Multiple Feature Hashing for Large-Scale Near-Duplicate Video Retrieval', IEEE TRANSACTIONS ON MULTIMEDIA, vol. 15, no. 8, pp. 1997-2008.View/Download from: Publisher's site
Yang, Y, Huang, Z, Yang, Y, Liu, J, Shen, HT & Luo, J 2013, 'Local image tagging via graph regularized joint group sparsity', PATTERN RECOGNITION, vol. 46, no. 5, pp. 1358-1368.View/Download from: Publisher's site
Yang, Y, Ma, Z, Hauptmann, AG & Sebe, N 2013, 'Feature Selection for Multimedia Analysis by Sharing Information Among Multiple Tasks', IEEE TRANSACTIONS ON MULTIMEDIA, vol. 15, no. 3, pp. 661-669.View/Download from: Publisher's site
Yang, Y, Song, J, Huang, Z, Ma, Z, Sebe, N & Hauptmann, AG 2013, 'Multi-Feature Fusion via Hierarchical Regression for Multimedia Analysis', IEEE TRANSACTIONS ON MULTIMEDIA, vol. 15, no. 3, pp. 572-581.View/Download from: Publisher's site
Yang, Y, Yang, Y & Shen, HT 2013, 'Effective Transfer Tagging from Image to Video', ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, vol. 9, no. 2.View/Download from: Publisher's site
Yang, Y, Yang, Y, Shen, HT, Zhang, Y, Du, X & Zhou, X 2013, 'Discriminative Nonnegative Spectral Clustering with Out-of-Sample Extension', IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, vol. 25, no. 8, pp. 1760-1771.View/Download from: Publisher's site
Zhang, L, Han, Y, Yang, Y, Song, M, Yan, S & Tian, Q 2013, 'Discovering Discriminative Graphlets for Aerial Image Categories Recognition', IEEE TRANSACTIONS ON IMAGE PROCESSING, vol. 22, no. 12, pp. 5071-5084.View/Download from: Publisher's site
Feng, Y, Xiao, J, Zha, Z, Zhang, H & Yang, Y 2012, 'Active learning for social image retrieval using Locally Regressive Optimal Design', NEUROCOMPUTING, vol. 95, pp. 54-59.View/Download from: Publisher's site
Liu, Y, Wu, F, Yang, Y, Zhuang, Y & Hauptmann, AG 2012, 'Spline Regression Hashing for Fast Image Search', IEEE TRANSACTIONS ON IMAGE PROCESSING, vol. 21, no. 10, pp. 4480-4491.View/Download from: Publisher's site
Ma, Z, Nie, F, Yang, Y, Uijlings, JRR & Sebe, N 2012, 'Web Image Annotation Via Subspace-Sparsity Collaborated Feature Selection', IEEE TRANSACTIONS ON MULTIMEDIA, vol. 14, no. 4, pp. 1021-1030.View/Download from: Publisher's site
Ma, Z, Nie, F, Yang, Y, Uijlings, JRR, Sebe, N & Hauptmann, AG 2012, 'Discriminating Joint Feature Analysis for Multimedia Data Understanding', IEEE TRANSACTIONS ON MULTIMEDIA, vol. 14, no. 6, pp. 1662-1672.View/Download from: Publisher's site
Yang, Y, Nie, F, Xu, D, Luo, J, Zhuang, Y & Pan, Y 2012, 'A Multimedia Retrieval Framework Based on Semi-Supervised Ranking and Relevance Feedback', IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, vol. 34, no. 4, pp. 723-742.View/Download from: Publisher's site
Yang, Y, Wu, F, Nie, F, Shen, HT, Zhuang, Y & Hauptmann, AG 2012, 'Web and Personal Image Annotation by Mining Label Correlation With Relaxed Visual Graph Embedding', IEEE TRANSACTIONS ON IMAGE PROCESSING, vol. 21, no. 3, pp. 1339-1351.View/Download from: Publisher's site
Zha, Z-J, Wang, M, Zheng, Y-T, Yang, Y, Hong, R & Chua, T-S 2012, 'Interactive Video Indexing With Statistical Active Learning', IEEE TRANSACTIONS ON MULTIMEDIA, vol. 14, no. 1, pp. 17-27.View/Download from: Publisher's site
Chen, C, Yang, Y, Nie, F & Odobez, J-M 2011, '3D human pose recovery from image by efficient visual feature selection', COMPUTER VISION AND IMAGE UNDERSTANDING, vol. 115, no. 3, pp. 290-299.View/Download from: Publisher's site
Chen, C, Zhuang, Y, Nie, F, Yang, Y, Wu, F & Xiao, J 2011, 'Learning a 3D Human Pose Distance Metric from Geometric Pose Descriptor', IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, vol. 17, no. 11, pp. 1676-1689.View/Download from: Publisher's site
Pan, H & Yang, Y 2010, 'Combining location and feature information for multimedia retrieval', International Journal of Computer Applications in Technology, vol. 38, no. 1-3, pp. 27-33.View/Download from: Publisher's site
In this paper, we propose a cross-media retrieval method for heterogeneous multimedia data by which the query examples and the returned results can be of different modalities, e.g., to query images by an example of audio clip. Taking multimedia location and content information into consideration, an affinity propagation based clustering approach is proposed to analyse and fuse the information carried by the co-existing multimedia objects so as to learn the semantic correlations among the heterogeneous multimedia data and perform cross-media retrieval. We also propose active learning methods of Relevance Feedback to make the search model more accurate. Copyright © 2010 Inderscience Enterprises Ltd.
Wu, F, Wang, W, Yang, Y, Zhuang, Y & Nie, F 2010, 'Classification by semi-supervised discriminative regularization', NEUROCOMPUTING, vol. 73, no. 10-12, pp. 1641-1651.View/Download from: Publisher's site
Yang, Y, Wu, F, Xu, D, Zhuang, Y & Chia, L-T 2010, 'Cross-media retrieval using query dependent search methods', PATTERN RECOGNITION, vol. 43, no. 8, pp. 2927-2936.View/Download from: Publisher's site
Yang, Y, Xu, D, Nie, F, Yan, S & Zhuang, Y 2010, 'Image Clustering Using Local Discriminant Models and Global Integration', IEEE TRANSACTIONS ON IMAGE PROCESSING, vol. 19, no. 10, pp. 2761-2773.View/Download from: Publisher's site
Yang, Y, Zhuang, Y, Tao, D, Xu, D, Yu, J & Luo, J 2010, 'Recognizing Cartoon Image Gestures for Retrieval and Interactive Cartoon Clip Synthesis', IEEE Transactions on Circuits and Systems for Video Technology, vol. 20, no. 12, pp. 1745-1756.View/Download from: Publisher's site
In this paper, we propose a new method to recognize gestures of cartoon images with two practical applications, i.e., content-based cartoon image retrieval and interactive cartoon clip synthesis. Upon analyzing the unique properties of four types of features including global color histogram, local color histogram (LCH), edge feature (EF), and motion direction feature (MDF), we propose to employ different features for different purposes and in various phases. We use EF to define a graph and then refine its local structure by LCH. Based on this graph, we adopt a transductive learning algorithm to construct local patches for each cartoon image. A spectral method is then proposed to optimize the local structure of each patch and then align these patches globally. MDF is fused with EF and LCH and a cartoon gesture space is constructed for cartoon image gesture recognition. We apply the proposed method to content-based cartoon image retrieval and interactive cartoon clip synthesis. The experiments demonstrate the effectiveness of our method.
Yang, Y, Guo, T, Zhuang, Y & Wang, W 2009, 'Cross-media retrieval based on synthesis reasoning model', Jisuanji Fuzhu Sheji Yu Tuxingxue Xuebao/Journal of Computer-Aided Design and Computer Graphics, vol. 21, no. 9, pp. 1307-1314.
To gain better cross-media retrieval performance, it is crucial to mine the semantic correlations among the heterogeneous multimedia data. In this paper, we adopt the synthesis reasoning model as the underlying mechanism to mining the multimedia semantics for cross-media retrieval. We construct the synthesis reasoning sources according to the multimedia object low-level features and the reasoning source intensity field according to the multimedia co-existence information. A series of multimedia semantic spaces are built by spectral method after synthesis reasoning. The cross-media retrieval is performed on a per-query basis by which different retrieval methods are adopted for different queries. Both short term and long term relevance feedback are learned to introduce the new multimedia objects into the multimedia semantic spaces which were not in the training set, to refine the reasoning result. Experimental results show that the proposed methods can be used to accurately mine the multimedia semantics and the approach of cross-media retrieval is accurate and stable.
Yang, Y, Zhuang, Y-T, Wu, F & Pan, Y-H 2008, 'Harmonizing hierarchical manifolds for multimedia document semantics understanding and cross-media retrieval', IEEE TRANSACTIONS ON MULTIMEDIA, vol. 10, no. 3, pp. 437-446.View/Download from: Publisher's site
Zhuang, YT, Yang, Y & Wu, F 2008, 'Mining semantic correlation of heterogeneous multimedia data for cross-media retrieval', IEEE Transactions on Multimedia, vol. 10, no. 2, pp. 221-229.View/Download from: Publisher's site
Although multimedia objects such as images, as udios and texts are of different modalities, there are a great amount of semantic correlations among them. In this paper, we propose a method of transductive learning to mine the semantic correlations among media objects of different modalities so that to achieve the cross-media retrieval. Cross-media retrieval is a new kind of searching technology by which the query examples and the returned results can be of different modalities, e.g., to query images by an example of audio. First, according to the media objects features and their co-existence information, we construct a uniform cross-media correlation graph, in which media objects of different modalities are represented uniformly. To perform the cross-media retrieval, a positive score is assigned to the query example; the score spreads along the graph and media objects of target modality or MMDs with the highest scores are returned. To boost the retrieval performance, we also propose different approaches of long-term and short-term relevance feedback to mine the information contained in the positive and negative examples. © 2008 IEEE.
Zhuang, Y-T, Yang, Y & Wu, F 2008, 'Mining semantic correlation of heterogeneous multimedia data for cross-media retrieval', IEEE TRANSACTIONS ON MULTIMEDIA, vol. 10, no. 2, pp. 221-229.View/Download from: Publisher's site
Cai, Y, Yang, Y, Hauptmann, A & Wactlar, H 2015, 'Monitoring and coaching the use of home medical devices' in Briassouli, A, Benois-Pineau, J & Hauptmann, A (eds), Health Monitoring and Personalized Feedback using Multimedia Data, Springer, Germany, pp. 265-283.View/Download from: Publisher's site
© Springer International Publishing Switzerland 2015. Despite the popularity of home medical devices, serious safety concerns have been raised, because the use-errors of home medical devices have linked to a large number of fatal hazards. To resolve the problem, we introduce a cognitive assistive system to automatically monitor the use of home medical devices. Being able to accurately recognize user operations is one of the most important functionalities of the proposed system. However, even though various action recognition algorithms have been proposed in recent years, it is still unknown whether they are adequate for recognizing operations in using home medical devices. Since the lack of the corresponding database is the main reason causing the situation, at the first part of this paper, we present a database specially designed for studying the use of home medical devices. Then, we evaluate the performance of the existing approaches on the proposed database. Although using state-of-art approaches which have demonstrated near perfect performance in recognizing certain general human actions, we observe significant performance drop when applying it to recognize device operations. We conclude that the tiny actions involved in using devices is one of the most important reasons leading to the performance decrease. To accurately recognize tiny actions, it's critical to focus on where the target action happens, namely the region of interest (ROI) and have more elaborate action modeling based on the ROI. Therefore, in the second part of this paper, we introduce a simple but effective approach to estimating ROI for recognizing tiny actions. The key idea of this method is to analyze the correlation between an action and the sub-regions of a frame. The estimated ROI is then used as a filter for building more accurate action representations. Experimental results show significant performance improvements over the baseline methods by using the estimated ROI for action recogn...
Dong, X & Yang, Y 2019, 'One-Shot Neural Architecture Search via Self-Evaluated Template Network', 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019 IEEE/CVF International Conference on Computer Vision (ICCV), IEEE, Seoul, Korea (South), Korea.View/Download from: Publisher's site
Dong, X & Yang, Y 2019, 'Teacher supervises students how to learn from partially labeled images for facial landmark detection', Proceedings of the IEEE International Conference on Computer Vision, IEEE/CVF International Conference on Computer Vision, IEEE, Seoul, Korea (South), pp. 783-792.View/Download from: Publisher's site
© 2019 IEEE. Facial landmark detection aims to localize the anatomically defined points of human faces. In this paper, we study facial landmark detection from partially labeled facial images. A typical approach is to (1) train a detector on the labeled images; (2) generate new training samples using this detector's prediction as pseudo labels of unlabeled images; (3) retrain the detector on the labeled samples and partial pseudo labeled samples. In this way, the detector can learn from both labeled and unlabeled data and become robust. In this paper, we propose an interaction mechanism between a teacher and two students to generate more reliable pseudo labels for unlabeled data, which are beneficial to semi-supervised facial landmark detection. Specifically, the two students are instantiated as dual detectors. The teacher learns to judge the quality of the pseudo labels generated by the students and filter out unqualified samples before the retraining stage. In this way, the student detectors get feedback from their teacher and are retrained by premium data generated by itself. Since the two students are trained by different samples, a combination of their predictions will be more robust as the final prediction compared to either prediction. Extensive experiments on 300-W and AFLW benchmarks show that the interactions between teacher and students contribute to better utilization of the unlabeled data and achieves state-of-the-art performance.
Fan, H & Yang, Y 2020, 'Person Tube Retrieval via Language Description', New York.
Feng, Q, Yang, Z, Li, P, Wei, Y & Yang, Y 2019, 'Dual embedding learning for video instance segmentation', Proceedings - 2019 International Conference on Computer Vision Workshop, ICCVW 2019, IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), IEEE, Seoul, Korea (South), pp. 717-720.View/Download from: Publisher's site
© 2019 IEEE. In this paper, we propose a novel framework to generate high-quality segmentation results in a two-stage style, aiming at video instance segmentation task which requires simultaneous detection, segmentation and tracking of instances. To address this multi-task efficiently, we opt to first select high-quality detection proposals in each frame. The categories of the proposals are calibrated with the global context of video. Then, each selected proposal is extended temporally by a bi-directional Instance-Pixel Dual-Tracker (IPDT) which synchronizes the tracking on both instance-level and pixel-level. The instance-level module concentrates on distinguishing the target instance from other objects while the pixel-level module focuses more on the local feature of the instance. Our proposed method achieved a competitive result of mAP 45.0% on the Youtube-VOS dataset, ranking the 3rd in Track 2 of the 2nd Large-scale Video Object Segmentation Challenge.
Li, G, Zhu, L, Liu, P & Yang, Y 2019, 'Entangled transformer for image captioning', Proceedings of the IEEE International Conference on Computer Vision, IEEE/CVF International Conference on Computer Vision, IEEE, Seoul, Korea (South), pp. 8927-8936.View/Download from: Publisher's site
© 2019 IEEE. In image captioning, the typical attention mechanisms are arduous to identify the equivalent visual signals especially when predicting highly abstract words. This phenomenon is known as the semantic gap between vision and language. This problem can be overcome by providing semantic attributes that are homologous to language. Thanks to the inherent recurrent nature and gated operating mechanism, Recurrent Neural Network (RNN) and its variants are the dominating architectures in image captioning. However, when designing elaborate attention mechanisms to integrate visual inputs and semantic attributes, RNN-like variants become unflexible due to their complexities. In this paper, we investigate a Transformer-based sequence modeling framework, built only with attention layers and feedforward layers. To bridge the semantic gap, we introduce EnTangled Attention (ETA) that enables the Transformer to exploit semantic and visual information simultaneously. Furthermore, Gated Bilateral Controller (GBC) is proposed to guide the interactions between the multimodal information. We name our model as ETA-Transformer. Remarkably, ETA-Transformer achieves state-of-the-art performance on the MSCOCO image captioning dataset. The ablation studies validate the improvements of our proposed modules.
Luo, Y, Liu, P, Guan, T, Yu, J & Yang, Y 2019, 'Significance-aware information bottleneck for domain adaptive semantic segmentation', Proceedings of the IEEE International Conference on Computer Vision, IEEE/CVF International Conference on Computer Vision, IEEE, Seoul, Korea, pp. 6777-6786.View/Download from: Publisher's site
© 2019 IEEE. For unsupervised domain adaptation problems, the strategy of aligning the two domains in latent feature space through adversarial learning has achieved much progress in image classification, but usually fails in semantic segmentation tasks in which the latent representations are overcomplex. In this work, we equip the adversarial network with a 'significance-aware information bottleneck (SIB)', to address the above problem. The new network structure, called SIBAN, enables a significance-aware feature purification before the adversarial adaptation, which eases the feature alignment and stabilizes the adversarial training course. In two domain adaptation tasks, i.e., GTA5 -> Cityscapes and SYNTHIA -> Cityscapes, we validate that the proposed method can yield leading results compared with other feature-space alternatives. Moreover, SIBAN can even match the state-of-the-art output-space methods in segmentation accuracy, while the latter are often considered to be better choices for domain adaptive segmentation task.
Luo, Y, Zheng, L, Guan, T, Yu, J & Yang, Y 2019, 'Taking a closer look at domain shift: Category-level adversaries for semantics consistent domain adaptation', Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Long Beach, CA, pp. 2502-2511.View/Download from: Publisher's site
© 2019 IEEE. We consider the problem of unsupervised domain adaptation in semantic segmentation. The key in this campaign consists in reducing the domain shift, i.e., enforcing the data distributions of the two domains to be similar. A popular strategy is to align the marginal distribution in the feature space through adversarial learning. However, this global alignment strategy does not consider the local category-level feature distribution. A possible consequence of the global movement is that some categories which are originally well aligned between the source and target may be incorrectly mapped. To address this problem, this paper introduces a category-level adversarial network, aiming to enforce local semantic consistency during the trend of global alignment. Our idea is to take a close look at the category-level data distribution and align each class with an adaptive adversarial loss. Specifically, we reduce the weight of the adversarial loss for category-level aligned features while increasing the adversarial force for those poorly aligned. In this process, we decide how well a feature is category-level aligned between source and target by a co-training approach. In two domain adaptation tasks, i.e., GTA5-> Cityscapes and SYNTHIA-> Cityscapes, we validate that the proposed method matches the state of the art in segmentation accuracy.
Miao, J, Wu, Y, Liu, P, DIng, Y & Yang, Y 2019, 'Pose-guided feature alignment for occluded person re-identification', Proceedings of the IEEE International Conference on Computer Vision, IEEE/CVF International Conference on Computer Vision, IEEE, Seoul, Korea (South), pp. 542-551.View/Download from: Publisher's site
© 2019 IEEE. Persons are often occluded by various obstacles in person retrieval scenarios. Previous person re-identification (re-id) methods, either overlook this issue or resolve it based on an extreme assumption. To alleviate the occlusion problem, we propose to detect the occluded regions, and explicitly exclude those regions during feature generation and matching. In this paper, we introduce a novel method named Pose-Guided Feature Alignment (PGFA), exploiting pose landmarks to disentangle the useful information from the occlusion noise. During the feature constructing stage, our method utilizes human landmarks to generate attention maps. The generated attention maps indicate if a specific body part is occluded and guide our model to attend to the non-occluded regions. During matching, we explicitly partition the global feature into parts and use the pose landmarks to indicate which partial features belonging to the target person. Only the visible regions are utilized for the retrieval. Besides, we construct a large-scale dataset for the Occluded Person Re-ID problem, namely Occluded-DukeMTMC, which is by far the largest dataset for the Occlusion Person Re-ID. Extensive experiments are conducted on our constructed occluded re-id dataset, two partial re-id datasets, and two commonly used holistic re-id datasets. Our method largely outperforms existing person re-id methods on three occlusion datasets, while remains top performance on two holistic datasets.
Pan, P, Yan, Y, Yang, T & Yang, Y 2020, 'Learning Discriminators as Energy Networks in Adversarial Learning', New York.
Quan, R, Dong, X, Wu, Y, Zhu, L & Yang, Y 2019, 'Auto-ReID: Searching for a Part-Aware ConvNet for Person Re-Identification', IEEE International Conference on Computer Vision, ICCV, IEEE/CVF International Conference on Computer Vision, IEEE, Seoul, Korea (South).View/Download from: Publisher's site
Prevailing deep convolutional neural networks (CNNs) for person re-IDentification (reID) are usually built upon ResNet or VGG backbones, which were originally designed for classification. Because reID is different from classification, the architecture should be modified accordingly. We propose to automatically search for a CNN architecture that is specifically suitable for the reID task. There are three aspects to be tackled. First, body structural information plays an important role in reID but it is not encoded in backbones. Second, Neural Architecture Search (NAS) automates the process of architecture design without human effort, but no existing NAS methods incorporate the structure information of input images. Third, reID is essentially a retrieval task but current NAS algorithms are merely designed for classification. To solve these problems, we propose a retrieval-based search algorithm over a specifically designed reID search space, named Auto-ReID. Our Auto-ReID enables the automated approach to find an efficient and effective CNN architecture for reID. Extensive experiments demonstrate that the searched architecture achieves state-of-the-art performance while reducing 50% parameters and 53% FLOPs compared to others.
Rao, Q, Li, G, Yang, Y, Zhang, F & Wang, Z 2020, 'UTS ISA submission at the TRECVID 2019 video to text description task', 2019 TREC Video Retrieval Evaluation, TRECVID 2019.
Copyright © TRECVID 2019.All rights reserved. In this paper, we summarize the technical details applied in our submission of TRECVID 2019 video to text task. The main effective improvements include three parts: Several efficient and comprehensive high-level features to gain expressive visual feature encodings, the algorithms in regulating and optimizing a robust language model, the expandable strategy to ensemble the well-trained single models. Besides, we conducted a meticulous evaluation of these techniques, and a comprehensive comparison of the experiments indicated the effectiveness of these techniques in video captioning.
Wang, H, Deng, C, Ma, F & Yang, Y 2020, 'Context Modulated Dynamic Networks for Actor and Action Video Segmentation with Language Queries', New York.
Wu, Y, Zhu, L, Yan, Y & Yang, Y 2019, 'Dual Attention Matching for Audio-Visual Event Localization', IEEE International Conference on Computer Vision, ICCV, IEEE/CVF International Conference on Computer Vision, IEEE, Seoul, Korea (South).View/Download from: Publisher's site
In this paper, we investigate the audio-visual event localization problem. This task is to localize a visible and audible event in a video. Previous methods first divide a video into short segments, and then fuse visual and acoustic features at the segment level. The duration of these segments is usually short, making the visual and acoustic feature of each segment possibly not well aligned. Direct concatenation of the two features at the segment level can be vulnerable to a minor temporal misalignment of the two signals. We propose a Dual Attention Matching (DAM) module to cover a longer video duration for better high-level event information modeling, while the local temporal information is attained by the global cross-check mechanism. Our premise is that one should watch the whole video to understand the high-level event, while shorter segments should be checked in detail for localization. Specifically, the global feature of one modality queries the local feature in the other modality in a bi-directional way. With temporal co-occurrence encoded between auditory and visual signals, DAM can be readily applied in various audio-visual event localization tasks, e.g., cross-modality localization, supervised event localization. Experiments on the AVE dataset show our method outperforms the state-of-the-art by a large margin.
Yang, Z, Dong, J, Liu, P, Yang, Y & Yan, S 2019, 'Very long natural scenery image prediction by outpainting', Proceedings of the IEEE International Conference on Computer Vision, IEEE/CVF International Conference on Computer Vision, IEEE, Seoul, Korea (South), pp. 10560-10569.View/Download from: Publisher's site
© 2019 IEEE. Comparing to image inpainting, image outpainting receives less attention due to two challenges in it. The first challenge is how to keep the spatial and content consistency between generated images and original input. The second challenge is how to maintain high quality in generated results, especially for multi-step generations in which generated regions are spatially far away from the initial input. To solve the two problems, we devise some innovative modules, named Skip Horizontal Connection and Recurrent Content Transfer, and integrate them into our designed encoder-decoder structure. By this design, our network can generate highly realistic outpainting prediction effectively and efficiently. Other than that, our method can generate new images with very long sizes while keeping the same style and semantic content as the given input. To test the effectiveness of the proposed architecture, we collect a new scenery dataset with diverse, complicated natural scenes. The experimental results on this dataset have demonstrated the efficacy of our proposed network.
Yang, Z, Li, P, Feng, Q, Wei, Y & Yang, Y 2019, 'Going deeper into embedding learning for video object segmentation', Proceedings - 2019 International Conference on Computer Vision Workshop, ICCVW 2019, IEEE/CVF International Conference on Computer Vision Workshop, IEEE, Seoul, Korea (South), pp. 697-700.View/Download from: Publisher's site
© 2019 IEEE. In this paper, we investigate the principles of consistent training, between given reference and predicted sequence, for better embedding learning of semi-supervised video object segmentation. To accurately segment the target objects given the mask at the first frame, we realize that the expected feature embeddings of any consecutive frames should satisfy the following properties: 1) global consistency in terms of both foreground object(s) and background; 2) robust local consistency under a various object moving rate; 3) environment consistency between the training and inference process; 4) receptive consistency between the receptive fields of network and the variable scales of objects; 5) sampling consistency between foreground and background pixels to avoid training bias. With the principles in mind, we carefully design a simple pipeline to lift both accuracy and efficiency for video object segmentation effectively. With the ResNet-101 as the backbone, our single model achieves a J&F score of 81.0% on the validation set of Youtube-VOS benchmark without any bells and whistles. By applying multi-scale & flip augmentation at the testing stage, the accuracy can be further boosted to 82.4%. Code will be made available.
Zhong, Z, Zheng, L, Kang, G, Li, S & Yang, Y 2020, 'Random Erasing Data Augmentation', New York.
Zhong, Z, Zheng, L, Luo, Z, Li, S & Yang, Y 2019, 'Invariance matters: Exemplar memory for domain adaptive person re-identification', Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Long Beach, CA, USA, pp. 598-607.View/Download from: Publisher's site
© 2019 IEEE. This paper considers the domain adaptive person re-identification (re-ID) problem: learning a re-ID model from a labeled source domain and an unlabeled target domain. Conventional methods are mainly to reduce feature distribution gap between the source and target domains. However, these studies largely neglect the intra-domain variations in the target domain, which contain critical factors influencing the testing performance on the target domain. In this work, we comprehensively investigate into the intra-domain variations of the target domain and propose to generalize the re-ID model w.r.t three types of the underlying invariance, i.e., exemplar-invariance, camera-invariance and neighborhood-invariance. To achieve this goal, an exemplar memory is introduced to store features of the target domain and accommodate the three invariance properties. The memory allows us to enforce the invariance constraints over global training batch without significantly increasing computation cost. Experiment demonstrates that the three invariance properties and the proposed memory are indispensable towards an effective domain adaptation system. Results on three re-ID domains show that our domain adaptation accuracy outperforms the state of the art by a large margin. Code is available at: https://github.com/zhunzhong07/ECN.
Copyright © TRECVID 2018.All rights reserved. This work describes our approach used for the fully automatic Ad-hoc Video Search (AVS) task  for TRECVID 2018. Our model is divided into two parts, visual model and language model. Our motivation is mapping video embedding and language embedding into same semantic space. We observe that by constructing triplets in the feature space we can take better advantage of large batches and hard examples. Our models are trained on MSR-VTT  and TGIF  dataset with different visual and language architectures. The final ensemble model achieves 6.7% mAP.
Zhu, F, Zhu, L & Yang, Y 2019, 'Sim-Real Joint Reinforcement Transfer for 3D Indoor Navigation', IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Long Beach, CA, USA, pp. 11388-11397.View/Download from: Publisher's site
There has been an increasing interest in 3D indoor navigation, where a robot in an environment moves to a target according to an instruction. To deploy a robot for navigation in the physical world, lots of training data is required to learn an effective policy. It is quite labour intensive to obtain sufficient real environment data for training robots while synthetic data is much easier to construct by render-ing. Though it is promising to utilize the synthetic environments to facilitate navigation training in the real world, real environment are heterogeneous from synthetic environment in two aspects. First, the visual representation of the two environments have significant variances. Second, the houseplans of these two environments are quite different. There-fore two types of information,i.e. visual representation and policy behavior, need to be adapted in the reinforce mentmodel. The learning procedure of visual representation and that of policy behavior are presumably reciprocal. We pro-pose to jointly adapt visual representation and policy behavior to leverage the mutual impacts of environment and policy. Specifically, our method employs an adversarial feature adaptation model for visual representation transfer anda policy mimic strategy for policy behavior imitation. Experiment shows that our method outperforms the baseline by 19.47% without any additional human annotations.
Zhu, M, Pan, P, Chen, W & Yang, Y 2020, 'EEMEFN: Low-Light Image Enhancement via Edge-Enhanced Multi-Exposure Fusion Network', New York.
He, Y 2019, 'Filter Pruning via Geometric Median for Deep Convolutional Neural Networks Acceleration', cvpr.
Liu, Y, Lee, J, Park, M, Kim, S, Yang, E, Hwang, SJ & Yang, Y 2019, 'Learning to propagate labels: Transductive propagation network for few-shot learning', 7th International Conference on Learning Representations, ICLR 2019.
© 7th International Conference on Learning Representations, ICLR 2019. All Rights Reserved. The goal of few-shot learning is to learn a classifier that generalizes well even when trained with a limited number of training instances per class. The recently introduced meta-learning approaches tackle this problem by learning a generic classifier across a large number of multiclass classification tasks and generalizing the model to a new task. Yet, even with such meta-learning, the low-data problem in the novel classification task still remains. In this paper, we propose Transductive Propagation Network (TPN), a novel meta-learning framework for transductive inference that classifies the entire test set at once to alleviate the low-data problem. Specifically, we propose to learn to propagate labels from labeled instances to unlabeled test instances, by learning a graph construction module that exploits the manifold structure in the data. TPN jointly learns both the parameters of feature embedding and the graph construction in an end-to-end manner. We validate TPN on multiple benchmark datasets, on which it largely outperforms existing few-shot learning approaches and achieves the state-of-the-art results.
Dong, X & Yang, Y 2019, 'network pruning via transformable architecture search', Vancouver.
Dong, X & Yang, Y 2019, 'Searching for a robust neural architecture in four GPU hours', Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 1761-1770.View/Download from: Publisher's site
© 2019 IEEE. Conventional neural architecture search (NAS) approaches are usually based on reinforcement learning or evolutionary strategy, which take more than 1000 GPU hours to find a good model on CIFAR-10. We propose an efficient NAS approach, which learns the searching approach by gradient descent. Our approach represents the search space as a directed acyclic graph (DAG). This DAG contains thousands of sub-graphs, each of which indicates a kind of neural architecture. To avoid traversing all the possibilities of the sub-graphs, we develop a differentiable sampler over the DAG. This sampler is learnable and optimized by the validation loss after training the sampled architecture. In this way, our approach can be trained in an end-to-end fashion by gradient descent, named Gradient-based search using Differentiable Architecture Sampler (GDAS). In experiments, we can finish one searching procedure in four GPU hours on CIFAR-10, and the discovered model obtains a test error of 2.82% with only 2.5M parameters, which is on par with the state-of-the-art.
Fan, H, Zhu, L & Yang, Y 2019, 'Cubic LSTMs for Video Prediction', The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI, AAAI Conference on Artificial Intelligence, AAAI Press, Honolulu,Hawaii, USA, pp. 8263-8270.View/Download from: Publisher's site
Predicting future frames in videos has become a promising direction of research for both computer vision and robot learning communities. The core of this problem involves moving object capture and future motion prediction. While object capture specifies which objects are moving in videos, motion prediction describes their future dynamics. Motivated by this analysis, we propose a Cubic Long Short-Term Memory (CubicLSTM) unit for video prediction. CubicLSTM consists of three branches, i.e., a spatial branch for capturing moving objects, a temporal branch for processing motions, and an output branch for combining the first two branches to generate predicted frames. Stacking multiple CubicLSTM units along the spatial branch and output branch, and then evolving along the temporal branch can form a cubic recurrent neural network (CubicRNN). Experiment shows that CubicRNN produces more accurate video predictions than prior methods on both synthetic and real-world datasets.
He, Y, Liu, P, Wang, Z, Hu, Z & Yang, Y 2019, 'Filter Pruning via Geometric Median for Deep Convolutional Neural Networks Acceleration', IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Long Beach, CA, USA.View/Download from: Publisher's site
Previous works utilized "smaller-norm-less-important" criterion to prune filters with smaller norm values in a convolutional neural network. In this paper, we analyze this norm-based criterion and point out that its effectiveness depends on two requirements that are not always met: (1) the norm deviation of the filters should be large; (2) the minimum norm of the filters should be small. To solve this problem, we propose a novel filter pruning method, namely Filter Pruning via Geometric Median (FPGM), to compress the model regardless of those two requirements. Unlike previous methods, FPGM compresses CNN models by pruning filters with redundancy, rather than those with"relatively less" importance. When applied to two image classification benchmarks, our method validates its usefulness and strengths. Notably, on CIFAR-10, FPGM reduces more than 52% FLOPs on ResNet-110 with even 2.69% relative accuracy improvement. Moreover, on ILSVRC-2012, FPGM reduces more than 42% FLOPs on ResNet-101 without top-5 accuracy drop, which has advanced the state-of-the-art. Code is publicly available on GitHub: https://github.com/he-y/filter-pruning-geometric-median
Hu, J, Yan, C, Liu, X, Zhang, J, Peng, D & Yang, Y 2019, 'Truncated gradient confidence-weighted based online learning for imbalance streaming data', Proceedings - IEEE International Conference on Multimedia and Expo, IEEE International Conference on Multimedia and Expo, IEEE, Shanghai, China, pp. 133-138.View/Download from: Publisher's site
© 2019 IEEE. Online learning for imbalanced streaming data is an important and challenging problem for many classification tasks in the machine learning research field. Traditional online learning algorithms are mainly focused on classification tasks with balanced data, and with little consideration about the characteristics of imbalanced streaming data. In this paper, we propose a novel online learning algorithm called Truncated Gradient Confidence-Weighted (TGCW), which integrate the truncated gradient algorithm with the confidence weighted algorithm together to improve the feature selection ability while reducing the dimensions of imbalanced streaming data effectively. We study a number of classification tasks with various imbalance data ratio including the pedestrian detection application and compare the performance of the TGCW algorithm with traditional online learning algorithms, and empirical results show that the TGCW algorithm can achieve better performance consistently than other baseline approaches.
Kang, G, Jiang, L, Yang, Y & Hauptmann, AG 2019, 'Contrastive adaptation network for unsupervised domain adaptation', Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Long Beach, CA, USA, pp. 4888-4897.View/Download from: Publisher's site
© 2019 IEEE. Unsupervised Domain Adaptation (UDA) makes predictions for the target domain data while manual annotations are only available in the source domain. Previous methods minimize the domain discrepancy neglecting the class information, which may lead to misalignment and poor generalization performance. To address this issue, this paper proposes Contrastive Adaptation Network (CAN) optimizing a new metric which explicitly models the intra-class domain discrepancy and the inter-class domain discrepancy. We design an alternating update strategy for training CAN in an end-to-end manner. Experiments on two real-world benchmarks Office-31 and VisDA-2017 demonstrate that CAN performs favorably against the state-of-the-art methods and produces more discriminative features.
Li, G, Zhu, L, Liu, P & Yang, Y 2019, 'Entangled Transformer for Image Captioning', IEEE International Conference on Computer Vision, ICCV, The IEEE International Conference on Computer Vision (ICCV), Long Beach, pp. 8928-8937.
Lin, Y, Dong, X, Zheng, L, Yan, Y & Yang, Y 2019, 'A Bottom-Up Clustering Approach to Unsupervised Person Re-Identification', Proceedings of the AAAI Conference on Artificial Intelligence, AAAI Conference on Artificial Intelligence, AAAI Press, Honolulu, HI, pp. 8738-8745.View/Download from: Publisher's site
Most person re-identification (re-ID) approaches are based on supervised learning, which requires intensive manual annotation for training data. However, it is not only resourceintensive to acquire identity annotation but also impractical to label the large-scale real-world data. To relieve this problem, we propose a bottom-up clustering (BUC) approach to jointly optimize a convolutional neural network (CNN) and the relationship among the individual samples. Our algorithm considers two fundamental facts in the re-ID task, i.e., diversity across different identities and similarity within the same identity. Specifically, our algorithm starts with regarding individual sample as a different identity, which maximizes the diversity over each identity. Then it gradually groups similar samples into one identity, which increases the similarity within each identity. We utilizes a diversity regularization term in the bottom-up clustering procedure to balance the data volume of each cluster. Finally, the model achieves an effective trade-off between the diversity and similarity. We conduct extensive experiments on the large-scale image and video re-ID datasets, including Market-1501, DukeMTMCreID, MARS and DukeMTMC-VideoReID. The experimental results demonstrate that our algorithm is not only superior to state-of-the-art unsupervised re-ID approaches, but also performs favorably than competing transfer learning and semi-supervised learning methods.
Liu, Y, Lee, J, Park, M, Kim, S, Yang, E, Hwang, SJ & Yang, Y 2019, 'Learning to propagate labels: Transductive propagation network for few-shot learning', 7th International Conference on Learning Representations, ICLR 2019, International Conference on Learning Representations, OpenReview, New Orleans, Louisiana, United States, pp. 1-14.
© 7th International Conference on Learning Representations, ICLR 2019. All Rights Reserved. The goal of few-shot learning is to learn a classifier that generalizes well even when trained with a limited number of training instances per class. The recently introduced meta-learning approaches tackle this problem by learning a generic classifier across a large number of multiclass classification tasks and generalizing the model to a new task. Yet, even with such meta-learning, the low-data problem in the novel classification task still remains. In this paper, we propose Transductive Propagation Network (TPN), a novel meta-learning framework for transductive inference that classifies the entire test set at once to alleviate the low-data problem. Specifically, we propose to learn to propagate labels from labeled instances to unlabeled test instances, by learning a graph construction module that exploits the manifold structure in the data. TPN jointly learns both the parameters of feature embedding and the graph construction in an end-to-end manner. We validate TPN on multiple benchmark datasets, on which it largely outperforms existing few-shot learning approaches and achieves the state-of-the-art results.
Liu, Y, Yan, Y, Chen, L, Han, Y & Yang, Y 2019, 'Adaptive Sparse Confidence-Weighted Learning for Online Feature Selection', THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 33rd AAAI Conference on Artificial Intelligence / 31st Innovative Applications of Artificial Intelligence Conference / 9th AAAI Symposium on Educational Advances in Artificial Intelligence, ASSOC ADVANCEMENT ARTIFICIAL INTELLIGENCE, Honolulu, HI, pp. 4408-4415.
Liu, Y, Yan, Y, Chen, L, Han, Y & Yang, Y 2019, 'Adaptive Sparse Confidence-Weighted Learning for Online Feature Selection', Proceedings of the AAAI Conference on Artificial Intelligence. AAAI Conference on Artificial Intelligence, AAAI Conference on Artificial Intelligence, AAAI Press, Honolulu, Hawaii USA, pp. 4408-4415.View/Download from: Publisher's site
In this paper, we propose a new online feature selection algorithm for streaming data. We aim to focus on the following two problems which remain unaddressed in literature. First, most existing online feature selection algorithms merely utilize the first-order information of the data streams, regardless of the fact that second-order information explores the correlations between features and significantly improves the performance. Second, most online feature selection algorithms are based on the balanced data presumption, which is not true in many real-world applications. For example, in fraud detection, the number of positive examples are much less than negative examples because most cases are not fraud. The balanced assumption will make the selected features biased towards the majority class and fail to detect the fraud cases. We propose an Adaptive Sparse Confidence-Weighted (ASCW) algorithm to solve the aforementioned two problems. We first introduce an `0-norm constraint into the second-order confidence-weighted (CW) learning for feature selection. Then the original loss is substituted with a cost-sensitive loss function to address the imbalanced data issue. Furthermore, our algorithm maintains multiple sparse CW learner with the corresponding cost vector to dynamically select an optimal cost. We theoretically enhance the theory of sparse CW learning and analyze the performance behavior in F-measure. Empirical studies show the superior performance over the stateof-the-art online learning methods in the online-batch setting.
Wu, A, Han, Y & Yang, Y 2019, 'Video interactive captioning with human prompts', IJCAI International Joint Conference on Artificial Intelligence, International Joint Conferences on Artificial Intelligence, International Joint Conferences on Artificial Intelligence Organization, Macao, pp. 961-967.View/Download from: Publisher's site
© 2019 International Joint Conferences on Artificial Intelligence. All rights reserved. Video captioning aims at generating a proper sentence to describe the video content. As a video often includes rich visual content and semantic details, different people may be interested in different views. Thus the generated sentence always fails to meet the ad hoc expectations. In this paper, we make a new attempt that, we launch a round of interaction between a human and a captioning agent. After generating an initial caption, the agent asks for a short prompt from the human as a clue of his expectation. Then, based on the prompt, the agent could generate a more accurate caption. We name this process a new task of video interactive captioning (ViCap). Taking a video and an initial caption as input, we devise the ViCap agent which consists of a video encoder, an initial caption encoder, and a refined caption generator. We show that the ViCap can be trained via a full supervision (with ground-truth) way or a weak supervision (with only prompts) way. For the evaluation of ViCap, we first extend the MSRVTT with interaction ground-truth. Experimental results not only show the prompts can help generate more accurate captions, but also demonstrate the good performance of the proposed method.
Zhang, H, Zhou, P, Yang, Y & Feng, J 2019, 'Generalized majorization-minimization for non-convex optimization', Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, International Joint Conference on Artificial Intelligence, International Joint Conferences on Artificial Intelligence, Macao, pp. 4257-4263.View/Download from: Publisher's site
© 2019 International Joint Conferences on Artificial Intelligence. All rights reserved. Majorization-Minimization (MM) algorithms optimize an objective function by iteratively minimizing its majorizing surrogate and offer attractively fast convergence rate for convex problems. However, their convergence behaviors for non-convex problems remain unclear. In this paper, we propose a novel MM surrogate function from strictly upper bounding the objective to bounding the objective in expectation. With this generalized surrogate conception, we develop a new optimization algorithm, termed SPI-MM, that leverages the recent proposed SPIDER for more efficient non-convex optimization. We prove that for finite-sum problems, the SPI-MM algorithm converges to an stationary point within deterministic and lower stochastic gradient complexity. To our best knowledge, this work gives the first non-asymptotic convergence analysis for MM-alike algorithms in general non-convex optimization. Extensive empirical studies on nonconvex logistic regression and sparse PCA demonstrate the advantageous efficiency of the proposed algorithm and validate our theoretical results.
Zheng, Z, Yang, X, Yu, Z, Zheng, L, Yang, Y & Kautz, J 2019, 'Joint discriminative and generative learning for person re-identification', Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 2133-2142.View/Download from: Publisher's site
© 2019 IEEE. Person re-identification (re-id) remains challenging due to significant intra-class variations across different cameras. Recently, there has been a growing interest in using generative models to augment training data and enhance the invariance to input changes. The generative pipelines in existing methods, however, stay relatively separate from the discriminative re-id learning stages. Accordingly, re-id models are often trained in a straightforward manner on the generated data. In this paper, we seek to improve learned re-id embeddings by better leveraging the generated data. To this end, we propose a joint learning framework that couples re-id learning and data generation end-to-end. Our model involves a generative module that separately encodes each person into an appearance code and a structure code, and a discriminative module that shares the appearance encoder with the generative module. By switching the appearance or structure codes, the generative module is able to generate high-quality cross-id composed images, which are online fed back to the appearance encoder and used to improve the discriminative module. The proposed joint learning framework renders significant improvement over the baseline without using generated data, leading to the state-of-the-art performance on several benchmark datasets.
Zhu, M, Pan, P, Chen, W & Yang, Y 2019, 'DM-GAN: Dynamic memory generative adversarial networks for text-to-image synthesis', Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 5795-5803.View/Download from: Publisher's site
© 2019 IEEE. In this paper, we focus on generating realistic images from text descriptions. Current methods first generate an initial image with rough shape and color, and then refine the initial image to a high-resolution one. Most existing text-to-image synthesis methods have two main problems. (1) These methods depend heavily on the quality of the initial images. If the initial image is not well initialized, the following processes can hardly refine the image to a satisfactory quality. (2) Each word contributes a different level of importance when depicting different image contents, however, unchanged text representation is used in existing image refinement processes. In this paper, we propose the Dynamic Memory Generative Adversarial Network (DM-GAN) to generate high-quality images. The proposed method introduces a dynamic memory module to refine fuzzy image contents, when the initial images are not well generated. A memory writing gate is designed to select the important text information based on the initial image content, which enables our method to accurately generate images from the text description. We also utilize a response gate to adaptively fuse the information read from the memories and the image features. We evaluate the DM-GAN model on the Caltech-UCSD Birds 200 dataset and the Microsoft Common Objects in Context dataset. Experimental results demonstrate that our DM-GAN model performs favorably against the state-of-the-art approaches.
Chang, X, Huang, PY, Shen, YD, Liang, X, Yang, Y & Hauptmann, AG 2018, 'RCAA: Relational context-aware agents for person search', Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), European Conference on Computer Vision, Springer, Munich, Germany, pp. 86-102.View/Download from: Publisher's site
© Springer Nature Switzerland AG 2018. We aim to search for a target person from a gallery of whole scene images for which the annotations of pedestrian bounding boxes are unavailable. Previous approaches to this problem have relied on a pedestrian proposal net, which may generate redundant proposals and increase the computational burden. In this paper, we address this problem by training relational context-aware agents which learn the actions to localize the target person from the gallery of whole scene images. We incorporate the relational spatial and temporal contexts into the framework. Specifically, we propose to use the target person as the query in the query-dependent relational network. The agent determines the best action to take at each time step by simultaneously considering the local visual information, the relational and temporal contexts, together with the target person. To validate the performance of our approach, we conduct extensive experiments on the large-scale Person Search benchmark dataset and achieve significant improvements over the compared approaches. It is also worth noting that the proposed model even performs better than traditional methods with perfect pedestrian detectors.
Deng, W, Zheng, L, Ye, Q, Kang, G, Yang, Y & Jiao, J 2018, 'Image-Image Domain Adaptation with Preserved Self-Similarity and Domain-Dissimilarity for Person Re-identification', 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, UT, pp. 994-1003.View/Download from: Publisher's site
Dong, X, Yan, Y, Ouyang, W & Yang, Y 2018, 'Style Aggregated Network for Facial Landmark Detection', 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, UT, USA.View/Download from: Publisher's site
Dong, X, Yu, SI, Weng, X, Wei, SE, Yang, Y & Sheikh, Y 2018, 'Supervision-by-Registration: An Unsupervised Approach to Improve the Precision of Facial Landmark Detectors', Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, UT, USA, pp. 360-368.View/Download from: Publisher's site
© 2018 IEEE. In this paper, we present supervision-by-registration, an unsupervised approach to improve the precision of facial landmark detectors on both images and video. Our key observation is that the detections of the same landmark in adjacent frames should be coherent with registration, i.e., optical flow. Interestingly, coherency of optical flow is a source of supervision that does not require manual labeling, and can be leveraged during detector training. For example, we can enforce in the training loss function that a detected landmark at framet-1 followed by optical flow tracking from framet-1 to framet should coincide with the location of the detection at framet. Essentially, supervision-by-registration augments the training loss function with a registration loss, thus training the detector to have output that is not only close to the annotations in labeled images, but also consistent with registration on large amounts of unlabeled videos. End-to-end training with the registration loss is made possible by a differentiable Lucas-Kanade operation, which computes optical flow registration in the forward pass, and back-propagates gradients that encourage temporal coherency in the detector. The output of our method is a more precise image-based facial landmark detector, which can be applied to single images or video. With supervision-by-registration, we demonstrate (1) improvements in facial landmark detection on both images (300W, ALFW) and video (300VW, Youtube-Celebrities), and (2) significant reduction of jittering in video detections.
Dong, X, Zhu, L, Zhang, D, Yang, Y & Wu, F 2018, 'Fast Parameter Adaptation for Few-shot Image Captioning and Visual Question Answering', ACM Multimedia Conference on Multimedia Conference, MM, pp. 54-62.View/Download from: Publisher's site
Fan, H, Xu, Z, Zhu, L, Yan, C, Ge, J & Yang, Y 2018, 'Watching a Small Portion could be as Good as Watching All: Towards Efficient Video Classification', Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI, pp. 705-711.View/Download from: Publisher's site
He, Y, Kang, G, Dong, X, Fu, Y & Yang, Y 2019, 'Soft filter pruning for accelerating deep convolutional neural networks', IJCAI International Joint Conference on Artificial Intelligence, International Joint Conference on Artificial Intelligence, Stockholm, Sweden, pp. 2234-2240.
© 2018 International Joint Conferences on Artificial Intelligence. All right reserved. This paper proposed a Soft Filter Pruning (SFP) method to accelerate the inference procedure of deep Convolutional Neural Networks (CNNs). Specifically, the proposed SFP enables the pruned filters to be updated when training the model after pruning. SFP has two advantages over previous works: (1) Larger model capacity. Updating previously pruned filters provides our approach with larger optimization space than fixing the filters to zero. Therefore, the network trained by our method has a larger model capacity to learn from the training data. (2) Less dependence on the pre-trained model. Large capacity enables SFP to train from scratch and prune the model simultaneously. In contrast, previous filter pruning methods should be conducted on the basis of the pre-trained model to guarantee their performance. Empirically, SFP from scratch outperforms the previous filter pruning methods. Moreover, our approach has been demonstrated effective for many advanced CNN architectures. Notably, on ILSCRC-2012, SFP reduces more than 42% FLOPs on ResNet-101 with even 0.2% top-5 accuracy improvement, which has advanced the state-of-the-art.
Kang, G, Zheng, L, Yan, Y & Yang, Y 2018, 'Deep Adversarial Attention Alignment for Unsupervised Domain Adaptation: The Benefit of Target Expectation Maximization', Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), European Conference on Computer Vision, Springer Link, Munich, Germany, pp. 420-436.View/Download from: Publisher's site
© 2018, Springer Nature Switzerland AG. In this paper, we make two contributions to unsupervised domain adaptation (UDA) using the convolutional neural network (CNN). First, our approach transfers knowledge in all the convolutional layers through attention alignment. Most previous methods align high-level representations, e.g., activations of the fully connected (FC) layers. In these methods, however, the convolutional layers which underpin critical low-level domain knowledge cannot be updated directly towards reducing domain discrepancy. Specifically, we assume that the discriminative regions in an image are relatively invariant to image style changes. Based on this assumption, we propose an attention alignment scheme on all the target convolutional layers to uncover the knowledge shared by the source domain. Second, we estimate the posterior label distribution of the unlabeled data for target network training. Previous methods, which iteratively update the pseudo labels by the target network and refine the target network by the updated pseudo labels, are vulnerable to label estimation errors. Instead, our approach uses category distribution to calculate the cross-entropy loss for training, thereby ameliorating the error accumulation of the estimated labels. The two contributions allow our approach to outperform the state-of-the-art methods by +2.6% on the Office-31 dataset.
Li, Z, Nie, F, Chang, X, Ma, Z & Yang, Y 2018, 'Balanced clustering via exclusive lasso: A pragmatic approach', 32nd AAAI Conference on Artificial Intelligence, AAAI 2018, AAAI Conference on Artificial Intelligence, AAAI, New Orleans, USA, pp. 3596-3603.
Copyright © 2018, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Clustering is an effective technique in data mining to generate groups that are the matter of interest. Among various clustering approaches, the family of k-means algorithms and min-cut algorithms gain most popularity due to their simplicity and efficacy. The classical k-means algorithm partitions a number of data points into several subsets by iteratively updating the clustering centers and the associated data points. By contrast, a weighted undirected graph is constructed in min-cut algorithms which partition the vertices of the graph into two sets. However, existing clustering algorithms tend to cluster minority of data points into a subset, which shall be avoided when the target dataset is balanced. To achieve more accurate clustering for balanced dataset, we propose to leverage exclusive lasso on k-means and min-cut to regulate the balance degree of the clustering results. By optimizing our objective functions that build atop the exclusive lasso, we can make the clustering result as much balanced as possible. Extensive experiments on several large-scale datasets validate the advantage of the proposed algorithms compared to the state-of-the-art clustering algorithms.
Liu, W, Chang, X, Chen, L & Yang, Y 2018, 'Semi-supervised Bayesian attribute learning for person re-identification', Thirty-Second AAAI Conference on Artificial Intelligence, AAAI, New Orleans, Louisiana, USA.
Person re-identification (re-ID) tasks aim to identify the same person in multiple images captured from non-overlapping camera views. Most previous re-ID studies have attempted to solve this problem through either representation learning or metric learning, or by combining both techniques. Representation learning relies on the latent factors or attributes of the data. In most of these works, the dimensionality of the factors/attributes has to be manually determined for each new dataset. Thus, this approach is not robust. Metric learning optimizes a metric across the dataset to measure similarity according to distance. However, choosing the optimal method for computing these distances is data dependent, and learning the appropriate metric relies on a sufficient number of pair-wise labels. To overcome these limitations, we propose a novel algorithm for person re-ID, called semi-supervised Bayesian attribute learning. We introduce an Indian Buffet Process to identify the priors of the latent attributes. The dimensionality of attributes factors is then automatically determined by nonparametric Bayesian learning. Meanwhile, unlike traditional distance metric learning, we propose a re-identification probability distribution to describe how likely it is that a pair of images contains the same person. This technique relies solely on the latent attributes of both images. Moreover, pair-wise labels that are not known can be estimated from pair-wise labels that are known, making this a robust approach for semi-supervised learning. Extensive experiments demonstrate the superior performance of our algorithm over several state-of-the-art algorithms on small-scale datasets and comparable performance on large-scale re-ID datasets.
Luo, Y, Zheng, Z, Zheng, L, Guan, T, Yu, J & Yang, Y 2018, 'Macro-micro adversarial network for human parsing', Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), European Conference on Computer Vision 2018, Munich, Germany, pp. 424-440.View/Download from: Publisher's site
© Springer Nature Switzerland AG 2018. In human parsing, the pixel-wise classification loss has drawbacks in its low-level local inconsistency and high-level semantic inconsistency. The introduction of the adversarial network tackles the two problems using a single discriminator. However, the two types of parsing inconsistency are generated by distinct mechanisms, so it is difficult for a single discriminator to solve them both. To address the two kinds of inconsistencies, this paper proposes the Macro-Micro Adversarial Net (MMAN). It has two discriminators. One discriminator, Macro D, acts on the low-resolution label map and penalizes semantic inconsistency, e.g., misplaced body parts. The other discriminator, Micro D, focuses on multiple patches of the high-resolution label map to address the local inconsistency, e.g., blur and hole. Compared with traditional adversarial networks, MMAN not only enforces local and semantic consistency explicitly, but also avoids the poor convergence problem of adversarial networks when handling high resolution images. In our experiment, we validate that the two discriminators are complementary to each other in improving the human parsing accuracy. The proposed framework is capable of producing competitive parsing performance compared with the state-of-the-art methods, i.e., mIoU = 46.81% and 59.91% on LIP and PASCAL-Person-Part, respectively. On a relatively small dataset PPSS, our pre-trained model demonstrates impressive generalization ability. The code is publicly available at https://github.com/RoyalVane/MMAN.
Sun, Y, Zheng, L, Yang, Y, Tian, Q & Wang, S 2018, 'Beyond Part Models: Person Retrieval with Refined Part Pooling (and A Strong Convolutional Baseline)', Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), European Conference on Computer Vision, Springer, Munich, Germany, pp. 501-518.View/Download from: Publisher's site
© 2018, Springer Nature Switzerland AG. Employing part-level features offers fine-grained information for pedestrian image description. A prerequisite of part discovery is that each part should be well located. Instead of using external resources like pose estimator, we consider content consistency within each part for precise part location. Specifically, we target at learning discriminative part-informed features for person retrieval and make two contributions. (i) A network named Part-based Convolutional Baseline (PCB). Given an image input, it outputs a convolutional descriptor consisting of several part-level features. With a uniform partition strategy, PCB achieves competitive results with the state-of-the-art methods, proving itself as a strong convolutional baseline for person retrieval. (ii) A refined part pooling (RPP) method. Uniform partition inevitably incurs outliers in each part, which are in fact more similar to other parts. RPP re-assigns these outliers to the parts they are closest to, resulting in refined parts with enhanced within-part consistency. Experiment confirms that RPP allows PCB to gain another round of performance boost. For instance, on the Market-1501 dataset, we achieve (77.4+4.2)% mAP and (92.3+1.5)% rank-1 accuracy, surpassing the state of the art by a large margin. Code is available at: https://github.com/syfafterzy/PCB_RPP.
Wang, H, Chang, X, Shi, L, Yang, Y & Shen, YD 2018, 'Uncertainty sampling for action recognition via maximizing expected average precision', IJCAI International Joint Conference on Artificial Intelligence, International Joint Conference on Artificial Intelligence, ACM, Stockholm, Sweden, pp. 964-970.View/Download from: Publisher's site
© 2018 International Joint Conferences on Artificial Intelligence. All right reserved. Recognizing human actions in video clips has been an important topic in computer vision. Sufficient labeled data is one of the prerequisites for the good performance of action recognition algorithms. However, while abundant videos can be collected from the Internet, categorizing each video clip is time-consuming. Active learning is one way to alleviate the labeling labor by allowing the classifier to choose the most informative unlabeled instances for manual annotation. Among various active learning algorithms, uncertainty sampling is arguably the most widely-used strategy. Conventional uncertainty sampling strategies such as entropy-based methods are usually tested under accuracy. However, in action recognition Average Precision (AP) is an acknowledged evaluation metric, which is somehow ignored in the active learning community. It is defined as the area under the precision-recall curve. In this paper, we propose a novel uncertainty sampling algorithm for action recognition using expected AP. We conduct experiments on three real-world action recognition datasets and show that our algorithm outperforms other uncertainty-based active learning algorithms.
Wu, Y, Lin, Y, Dong, X, Yan, Y, Ouyang, W & Yang, Y 2018, 'Exploit the Unknown Gradually: One-Shot Video-Based Person Re-Identification by Stepwise Learning', 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, UT, pp. 5177-5186.View/Download from: Publisher's site
Yan, Y, Yang, T, Li, Z, Lin, Q & Yang, Y 2018, 'A unified analysis of stochastic momentum methods for deep learning', Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, International Joint Conference on Artificial Intelligence, International Joint Conferences on Artificial Intelligence, Stockholm, Sweden, pp. 2955-2961.View/Download from: Publisher's site
© 2018 International Joint Conferences on Artificial Intelligence. All right reserved. Stochastic momentum methods have been widely adopted in training deep neural networks. However, their theoretical analysis of convergence of the training objective and the generalization error for prediction is still under-explored. This paper aims to bridge the gap between practice and theory by analyzing the stochastic gradient (SG) method, and the stochastic momentum methods including two famous variants, i.e., the stochastic heavy-ball (SHB) method and the stochastic variant of Nesterov's accelerated gradient (SNAG) method. We propose a framework that unifies the three variants. We then derive the convergence rates of the norm of gradient for the non-convex optimization problem, and analyze the generalization performance through the uniform stability approach. Particularly, the convergence analysis of the training objective exhibits that SHB and SNAG have no advantage over SG. However, the stability analysis shows that the momentum term can improve the stability of the learned model and hence improve the generalization performance. These theoretical insights verify the common wisdom and are also corroborated by our empirical analysis on deep learning.
Zhang, X, Wei, Y, Feng, J, Yang, Y & Huang, T 2018, 'Adversarial Complementary Learning for Weakly Supervised Object Localization', Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, UT, USA, pp. 1325-1334.View/Download from: Publisher's site
© 2018 IEEE. In this work, we propose Adversarial Complementary Learning (ACoL) to automatically localize integral objects of semantic interest with weak supervision. We first mathematically prove that class localization maps can be obtained by directly selecting the class-specific feature maps of the last convolutional layer, which paves a simple way to identify object regions. We then present a simple network architecture including two parallel-classifiers for object localization. Specifically, we leverage one classification branch to dynamically localize some discriminative object regions during the forward pass. Although it is usually responsive to sparse parts of the target objects, this classifier can drive the counterpart classifier to discover new and complementary object regions by erasing its discovered regions from the feature maps. With such an adversarial learning, the two parallel-classifiers are forced to leverage complementary object regions for classification and can finally generate integral object localization together. The merits of ACoL are mainly two-fold: 1) it can be trained in an end-to-end manner; 2) dynamically erasing enables the counterpart classifier to discover complementary object regions more effectively. We demonstrate the superiority of our ACoL approach in a variety of experiments. In particular, the Top-1 localization error rate on the ILSVRC dataset is 45.14%, which is the new state-of-the-art.
Zhang, X, Wei, Y, Kang, G, Yang, Y & Huang, T 2018, 'Self-produced guidance for weakly-supervised object localization', Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), European Conference on Computer Vision, Springer, Munich, Germany, pp. 610-625.View/Download from: Publisher's site
© Springer Nature Switzerland AG 2018. Weakly supervised methods usually generate localization results based on attention maps produced by classification networks. However, the attention maps exhibit the most discriminative parts of the object which are small and sparse. We propose to generate Self-produced Guidance (SPG) masks which separate the foreground i.e., the object of interest, from the background to provide the classification networks with spatial correlation information of pixels. A stagewise approach is proposed to incorporate high confident object regions to learn the SPG masks. The high confident regions within attention maps are utilized to progressively learn the SPG masks. The masks are then used as an auxiliary pixel-level supervision to facilitate the training of classification networks. Extensive experiments on ILSVRC demonstrate that SPG is effective in producing high-quality object localizations maps. Particularly, the proposed SPG achieves the Top-1 localization error rate of 43.83% on the ILSVRC validation set, which is a new state-of-the-art error rate.
Zheng, L, Zhao, Y, Wang, S, Wang, J, Yang, Y & Tian, Q 2018, 'On the large-scale transferability of convolutional neural networks', Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer, Melbourne, VIC, Australia, pp. 27-39.View/Download from: Publisher's site
© Springer Nature Switzerland AG 2018. Given the overwhelming performance of the Convolutional Neural Network (CNN) in the computer vision and machine learning community, this paper aims at investigating the effective transfer of the CNN descriptors in generic and fine-grained classification at a large scale. Our contribution consists in providing some simple yet effective methods in constructing a competitive baseline recognition system. Comprehensively, we study two facts in CNN transfer. (1) We demonstrate the advantage of using images with a properly large size as input to CNN instead of the conventionally resized one. (2) We benchmark the performance of different CNN layers improved by average/max pooling on the feature maps. Our evaluation and observation confirm that the Conv5 descriptor yields very competitive accuracy under such a pooling strategy. Following these good practices, we are capable of producing improved performance on seven image classification benchmarks.
Zhong, Z, Zheng, L, Li, S & Yang, Y 2018, 'Generalizing a person retrieval model hetero- and homogeneously', Computer Vision – ECCV 2018 (LNCS 11217), European Conference on Computer Vision, Springer, Germany, pp. 176-192.View/Download from: Publisher's site
© Springer Nature Switzerland AG 2018. Person re-identification (re-ID) poses unique challenges for unsupervised domain adaptation (UDA) in that classes in the source and target sets (domains) are entirely different and that image variations are largely caused by cameras. Given a labeled source training set and an unlabeled target training set, we aim to improve the generalization ability of re-ID models on the target testing set. To this end, we introduce a Hetero-Homogeneous Learning (HHL) method. Our method enforces two properties simultaneously: (1) camera invariance, learned via positive pairs formed by unlabeled target images and their camera style transferred counterparts; (2) domain connectedness, by regarding source/target images as negative matching pairs to the target/source images. The first property is implemented by homogeneous learning because training pairs are collected from the same domain. The second property is achieved by heterogeneous learning because we sample training pairs from both the source and target domains. On Market-1501, DukeMTMC-reID and CUHK03, we show that the two properties contribute indispensably and that very competitive re-ID UDA accuracy is achieved. Code is available at: https://github.com/zhunzhong07/HHL.
Zhong, Z, Zheng, L, Zheng, Z, Li, S & Yang, Y 2018, 'Camera Style Adaptation for Person Re-identification', Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, UT, USA.View/Download from: Publisher's site
Chang, X, Yu, YL & Yang, Y 2017, 'Robust top-k multiclass SVM for visual category recognition', Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, Halifax, Nova Scotia, Canada, pp. 75-83.View/Download from: Publisher's site
© 2017 Association for Computing Machinery. Classification problems with a large number of classes inevitably involve overlapping or similar classes. In such cases it seems reasonable to allow the learning algorithm to make mistakes on similar classes, as long as the true class is still among the top-k (say) predictions. Likewise, in applications such as search engine or ad display, we are allowed to present k predictions at a time and the customer would be satisfied as long as her interested prediction is included. Inspired by the recent work of , we propose a very generic, robust multiclass SVM formulation that directly aims at minimizing a weighted and truncated combination of the ordered prediction scores. Our method includes many previous works as special cases. Computationally, using the Jordan decomposition Lemma we show how to rewrite our objective as the difference of two convex functions, based on which we develop an eficient algorithm that allows incorporating many popular regularizers (such as the l2and l1norms). We conduct extensive experiments on four real large-scale visual category recognition datasets, and obtain very promising performances.
Dong, X, Huang, J, Yang, Y & Yan, S 2017, 'More is less: A more complicated network with less inference complexity', Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Honolulu, HI, USA, pp. 1895-1903.View/Download from: Publisher's site
© 2017 IEEE. In this paper, we present a novel and general network structure towards accelerating the inference process of convolutional neural networks, which is more complicated in network structure yet with less inference complexity. The core idea is to equip each original convolutional layer with another low-cost collaborative layer (LCCL), and the element-wise multiplication of the ReLU outputs of these two parallel layers produces the layer-wise output. The combined layer is potentially more discriminative than the original convolutional layer, and its inference is faster for two reasons: 1) the zero cells of the LCCL feature maps will remain zero after element-wise multiplication, and thus it is safe to skip the calculation of the corresponding high-cost convolution in the original convolutional layer; 2) LCCL is very fast if it is implemented as a 1 × 1 convolution or only a single filter shared by all channels. Extensive experiments on the CIFAR-10, CIFAR-100 and ILSCRC-2012 benchmarks show that our proposed network structure can accelerate the inference process by 32% on average with negligible performance drop.
Fan, H, Chang, X, Cheng, D, Yang, Y, Xu, D & Hauptmann, AG 2017, 'Complex Event Detection by Identifying Reliable Shots from Untrimmed Videos', Proceedings of the IEEE International Conference on Computer Vision, International Conference on Computer Vision, IEEE, Venice, Italy, pp. 736-744.View/Download from: Publisher's site
© 2017 IEEE. The goal of complex event detection is to automatically detect whether an event of interest happens in temporally untrimmed long videos which usually consist of multiple video shots. Observing some video shots in positive (resp. negative) videos are irrelevant (resp. relevant) to the given event class, we formulate this task as a multi-instance learning (MIL) problem by taking each video as a bag and the video shots in each video as instances. To this end, we propose a new MIL method, which simultaneously learns a linear SVM classifier and infers a binary indicator for each instance in order to select reliable training instances from each positive or negative bag. In our new objective function, we balance the weighted training errors and a l1-l2 mixed-norm regularization term which adaptively selects reliable shots as training instances from different videos to have them as diverse as possible. We also develop an alternating optimization approach that can efficiently solve our proposed objective function. Extensive experiments on the challenging real-world Multimedia Event Detection (MED) datasets MEDTest-14, MEDTest-13 and CCV clearly demonstrate the effectiveness of our proposed MIL approach for complex event detection.
Li, G, Pan, P & Yang, Y 2017, 'UTS CAI submission at TRECVID 2017 video to text description task', 2017 TREC Video Retrieval Evaluation, TRECVID 2017.
Copyright © 2017 TRECVID 2017.All rights reserved. In this paper, we aim at summarizing our experiments related to the video to text description task of TRECVID 2017 . The task consists of two subtasks, i.e., Matching and Ranking, and Description generation. Our approach for description generation is based three main phases: the extraction of the high-level image feature, the aggregation of multiple image features and the sentence generation based on a probabilistic language model. For every phase, we tried several state-of-the-art techniques, and obtain the optimal combination according to the experiment results. In the matching and ranking task, we use the generated descriptions as the ground truth and rank the candidate descriptions with the similarity computed by two metrics: BLEU  and METEOR .
Liu, W, Chang, X, Chen, L & Yang, Y 2017, 'Early Active Learning with Pairwise Constraint for Person Re-identification', ECML PKDD 2017: Machine Learning and Knowledge Discovery in Databases, Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Springer, Skopje, Macedonia, pp. 103-118.View/Download from: Publisher's site
Research on person re-identification (re-id) has attached much attention in the machine learning field in recent years. With sufficient labeled training data, supervised re-id algorithm can obtain promising performance. However, producing labeled data for training supervised re-id models is an extremely challenging and time-consuming task because it requires every pair of images across no-overlapping camera views to be labeled. Moreover, in the early stage of experiments, when labor resources are limited, only a small number of data can be labeled. Thus, it is essential to design an effective algorithm to select the most representative samples. This is referred as early active learning or early stage experimental design problem. The pairwise relationship plays a vital role in the re-id problem, but most of the existing early active learning algorithms fail to consider this relationship. To overcome this limitation, we propose a novel and efficient early active learning algorithm with a pairwise constraint for person re-identification in this paper. By introducing the pairwise constraint, the closeness of similar representations of instances is enforced in active learning. This benefits the performance of active learning for re-id. Extensive experimental results on four benchmark datasets confirm the superiority of the proposed algorithm.
Liu, Z, Wang, Z, Zhang, L, Shah, RR, Xia, Y, Yang, Y & Li, X 2017, 'FastShrinkage: Perceptually-aware retargeting toward mobile platforms', MM 2017 - Proceedings of the 2017 ACM Multimedia Conference, ACM on Multimedia Conference, Association for Computing Machinery, Mountain View, California, USA, pp. 501-509.View/Download from: Publisher's site
© 2017 ACM. Retargeting aims at adapting an original high-resolution photo/video to a low-resolution screen with an arbitrary aspect ratio. Conventional approaches are generally based on desktop PCs, since the computation might be intolerable for mobile platforms (especially when retargeting videos). Besides, only low-level visual features are exploited typically, whereas human visual perception is not well encoded. In this paper, we propose a novel retargeting framework which fast shrinks photo/video by leveraging human gaze behavior. Specifically, we first derive a geometry-preserved graph ranking algorithm, which efficiently selects a few salient object patches to mimic human gaze shifting path (GSP) when viewing each scenery. Afterward, an aggregation-based CNN is developed to hierarchically learn the deep representation for each GSP. Based on this, a probabilistic model is developed to learn the priors of the training photos which are marked as aesthetically-pleasing by professional photographers. We utilize the learned priors to efficiently shrink the corresponding GSP of a retargeted photo/video to be maximally similar to those from the training photos. Extensive experiments have demonstrated that: 1) our method consumes less than 35ms to retarget a 1024 × 768 photo (or a 1280 × 720 video frame) on popular iOS/Android devices, which is orders of magnitude faster than the conventional retargeting algorithms; 2) the retargeted photos/videos produced by our method outperform its competitors significantly based on the paired-comparison-based user study; and 3) the learned GSPs are highly indicative of human visual attention according to the human eye tracking experiments.
Luo, M, Nie, F, Chang, X, Yang, Y, Hauptmann, A & Zheng, Q 2017, 'Probabilistic non-negative matrix factorization and its robust extensions for topic modeling', 31st AAAI Conference on Artificial Intelligence, AAAI 2017, Conference on Artificial Intelligence, San Francisco, California USA, pp. 2308-2314.
Copyright © 2017, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Traditional topic model with maximum likelihood estimate inevitably suffers from the conditional independence of words given the documents topic distribution. In this paper, we follow the generative procedure of topic model and learn the topic-word distribution and topics distribution via directly approximating the word-document co-occurrence matrix with matrix decomposition technique. These methods include: (1) Approximating the normalized document-word conditional distribution with the documents probability matrix and words probability matrix based on probabilistic non-negative matrix factorization (NMF); (2) Since the standard NMF is well known to be non-robust to noises and outliers, we extended the probabilistic NMF of the topic model to its robust versions using ℓ2, 1 -norm and capped ℓ2, 1 -norm based loss functions, respectively. The proposed framework inherits the explicit probabilistic meaning of factors in topic models and simultaneously makes the conditional independence assumption on words unnecessary. Straightforward and efficient algorithms are exploited to solve the corresponding non-smooth and non-convex problems. Experimental results over several benchmark datasets illustrate the effectiveness and superiority of the proposed methods.
Pan, P, Feng, J, Chen, L & Yang, Y 2017, 'Online compressed robust PCA', Proceedings of the International Joint Conference on Neural Networks, International Joint Conference on Neural Networks, IEEE, Anchorage, AK, USA, pp. 1041-1048.View/Download from: Publisher's site
© 2017 IEEE. In this work, we consider the problem of robust principal component analysis (RPCA) for streaming noisy data that has been highly compressed. This problem is prominent when one deals with high-dimensional and large-scale data and data compression is necessary. To solve this problem, we propose an online compressed RPCA algorithm to efficiently recover the low-rank components of raw data. Though data compression incurs severe information loss, we provide deep analysis on the proposed algorithm and prove that the low-rank component can be asymptotically recovered under mild conditions. Compared with other recent works on compressed RPCA, our algorithm reduces the memory cost significantly by processing data in an online fashion and reduces the communication cost by accepting sequential compressed data as input.
Xu, Z, Zhu, L & Yang, Y 2017, 'Few-Shot Object Recognition from Machine-Labeled Web Images', IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 5358-5366.View/Download from: Publisher's site
Yan, Y, Yang, T, Yang, Y & Chen, J 2017, 'A Framework of Online Learning with Imbalanced Streaming Data', Proceedings of the Thirty-Firs AAAI Conference on Artificial Intelligence, AAAI Conference on Artificial Intelligence, AAAI, San Francisco, USA, pp. 2817-2823.
A challenge for mining large-scale streaming data overlooked
by most existing studies on online learning is the skewdistribution
of examples over different classes. Many previous
works have considered cost-sensitive approaches in an
online setting for streaming data, where fixed costs are assigned
to different classes, or ad-hoc costs are adapted based
on the distribution of data received so far. However, it is
not necessary for them to achieve optimal performance in
terms of the measures suited for imbalanced data, such as Fmeasure,
area under ROC curve (AUROC), area under precision
and recall curve (AUPRC). This work proposes a general
framework for online learning with imbalanced streaming
data, where examples are coming sequentially and models
are updated accordingly on-the-fly. By simultaneously learning
multiple classifiers with different cost vectors, the proposed
method can be adopted for different target measures for
imbalanced data, including F-measure, AUROC and AUPRC.
Moreover, we present a rigorous theoretical justification of
the proposed framework for the F-measure maximization.
Our empirical studies demonstrate the competitive if not better
performance of the proposed method compared to previous
cost-sensitive and resampling based online learning algorithms
and those that are designed for optimizing certain
Yang, Y 2017, 'A Dual-Network Progressive Approach to Weakly Supervised Object Detection', MM '17 Proceedings of the 2017 ACM on Multimedia Conference, ACM on Multimedia Conference, ACM, Mountain View, pp. 279-287.View/Download from: Publisher's site
A major challenge that arises in Weakly Supervised Object Detection (WSOD) is that only image-level labels are available, whereas WSOD trains instance-level object detectors. A typical approach to WSOD is to 1) generate a series of region proposals for each image and assign the image-level label to all the proposals in that image; 2) train a classifier using all the proposals; and 3) use the classifier to select proposals with high confidence scores as the positive instances for another round of training. In this way, the image-level labels are iteratively transferred to instance-level labels.
We aim to resolve the following two fundamental problems within this paradigm. First, existing proposal generation algorithms are not yet robust, thus the object proposals are often inaccurate. Second, the selected positive instances are sometimes noisy and unreliable, which hinders the training at subsequent iterations. We adopt two separate neural networks, one to focus on each problem, to better utilize the specific characteristic of region proposal refinement and positive instance selection. Further, to leverage the mutual benefits of the two tasks, the two neural networks are jointly trained and reinforced iteratively in a progressive manner, starting with easy and reliable instances and then gradually incorporating difficult ones at a later stage when the selection classifier is more robust. Extensive experiments on the PASCAL VOC dataset show that our method achieves state-of-the-art performance.
Zheng, L, Zhang, H, Sun, S, Chandraker, M, Yang, Y & Tian, Q 2017, 'Person Re-identification in the Wild', Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Honolulu, HI, USA.View/Download from: Publisher's site
This paper presents a novel large-scale dataset and comprehensive baselines for end-to-end pedestrian detection and person recognition in raw video frames. Our baselines address three issues: the performance of various combinations of detectors and recognizers, mechanisms for pedestrian detection to help improve overall re-identification (re-ID) accuracy and assessing the effectiveness of different detectors for re-ID. We make three distinct contributions. First, a new dataset, PRW, is introduced to evaluate Person Re-identification in the Wild, using videos acquired through six synchronized cameras. It contains 932 identities and 11,816 frames in which pedestrians are annotated with their bounding box positions and identities. Extensive benchmarking results are presented on this dataset. Second, we show that pedestrian detection aids re-ID through two simple yet effective improvements: a cascaded fine-tuning strategy that trains a detection model first and then the classification model, and a Confidence Weighted Similarity (CWS) metric that incorporates detection scores into similarity measurement. Third, we derive insights in evaluating detector performance for the particular scenario of accurate person re-ID
Zheng, Z, Zheng, L & Yang, Y 2017, 'Unlabeled Samples Generated by GAN Improve the Person Re-identification Baseline in vitro', Proceedings 2017 IEEE International Conference on Computer Vision CCV 2017, 2017 IEEE International Conference on Computer Vision, IEEE, Venice, Italy.View/Download from: Publisher's site
The main contribution of this paper is a simple semi- supervised pipeline that only uses the original training se t without collecting extra data. It is challenging in 1) how to obtain more training data only from the training set and 2) how to use the newly generated data. In this work, the generative adversarial network (GAN) is used to generate unlabeled samples. We propose the label smoothing regu- larization for outliers (LSRO). This method assigns a uni- form label distribution to the unlabeled images, which reg- ularizes the supervised model and improves the baseline. We verify the proposed method on a practical problem: person re-identification (re-ID). This task aims to retriev e a query person from other cameras. We adopt the deep con- volutional generative adversarial network (DCGAN) for sample generation, anda baseline convolutionalneuralnet - work (CNN) for representation learning. Experiments show that adding the GAN-generated data effectively improves the discriminative ability of learned CNN embeddings. On three large-scale datasets, Market-1501, CUHK03 and DukeMTMC-reID, we obtain +4.37%, +1.6% and +2.46% improvement in rank-1 precision over the baseline CNN, respectively. We additionally apply the proposed method to fine-grained bird recognition and achieve a +0.6% im- provement over a strong baseline
Zhu, L, Xu, Z & Yang, Y 2017, 'Bidirectional Multirate Reconstruction for Temporal Modeling in Videos', IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 1339-1348.View/Download from: Publisher's site
Chang, X, Yu, YL, Yang, Y & Xing, EP 2016, 'They are not equally reliable: Semantic event search using differentiated concept classifiers', Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Las Vegas, Nevada, United States, pp. 1884-1893.View/Download from: Publisher's site
Complex event detection on unconstrained Internet videos has seen much progress in recent years. However, state-of-the-art performance degrades dramatically when the number of positive training exemplars falls short. Since label acquisition is costly, laborious, and time-consuming, there is a real need to consider the much more challenging semantic event search problem, where no example video is given. In this paper, we present a state-of-the-art event search system without any example videos. Relying on the key observation that events (e.g. dog show) are usually compositions of multiple mid-level concepts (e.g. "dog," "theater," and "dog jumping"), we first train a skip-gram model to measure the relevance of each concept with the event of interest. The relevant concept classifiers then cast votes on the test videos but their reliability, due to lack of labeled training videos, has been largely unaddressed. We propose to combine the concept classifiers based on a principled estimate of their accuracy on the unlabeled test videos. A novel warping technique is proposed to improve the performance and an efficient highly-scalable algorithm is provided to quickly solve the resulting optimization. We conduct extensive experiments on the latest TRECVID MEDTest 2014, MEDTest 2013 and CCV datasets, and achieve state-of-the-art performances.
Du, X, Yin, H, Huang, Z, Yang, Y & Zhou, X 2016, 'Using detected visual objects to index video database', Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Australasian Database Conference, Springer, Sydney, New South Wales, Australia, pp. 333-345.View/Download from: Publisher's site
© Springer International Publishing AG 2016.In this paper, we focus on how to use visual objects to index the videos. Two tables are constructed for this purpose, namely the unique object table and the occurrence table. The former table stores the unique objects which appear in the videos, while the latter table stores the occurrence information of these unique objects in the videos. In previous works, these two tables are generated manually by a topdown process. That is, the unique object table is given by the experts at first, then the occurrence table is generated by the annotators according to the unique object table. Obviously, such process which heavily depends on human labors limits the scalability especially when the data are dynamic or large-scale. To improve this, we propose to perform a bottom-up process to generate these two tables. The novelties are: we use object detector instead of human annotation to create the occurrence table; we propose a hybrid method which consists of local merge, global merge and propagation to generate the unique object table and fix the occurrence table. In fact, there are another three candidate methods for implementing the bottom-up process, namely, recognizing-based, matching-based and tracking-based methods. Through analyzing their mechanism and evaluating their accuracy, we find that they are not suitable for the bottom-up process. The proposed hybrid method leverages the advantages of the matching-based and tracking-based methods. Our experiments show that the hybrid method is more accurate and efficient than the candidate methods, which indicates that it is more suitable for the proposed bottom-up process.
Luo, M, Nie, F, Chang, X, Yang, Y, Hauptmann, A & Zheng, Q 2016, 'Avoiding optimal mean robust PCA/2DPCA with non-greedy ℓ1-norm maximization', IJCAI International Joint Conference on Artificial Intelligence, International Joint Conference on Artificial Intelligence, AAAI Press / International Joint Conferences on Artificial Intelligence, New York City, New York, United States, pp. 1802-1808.
Robust principal component analysis (PCA) is one of the most important dimension reduction techniques to handle high-dimensional data with outliers. However, the existing robust PCA presupposes that the mean of the data is zero and incorrectly utilizes the Euclidean distance based optimal mean for robust PCA with ℓ1-norm. Some studies consider this issue and integrate the estimation of the optimal mean into the dimension reduction objective, which leads to expensive computation. In this paper, we equivalently reformulate the maximization of variances for robust PCA, such that the optimal projection directions are learned by maximizing the sum of the projected difference between each pair of instances, rather than the difference between each instance and the mean of the data. Based on this reformulation, we propose a novel robust PCA to automatically avoid the calculation of the optimal mean based on ℓ1-norm distance. This strategy also makes the assumption of centered data unnecessary. Additionally, we intuitively extend the proposed robust PCA to its 2D version for image recognition. Efficient non-greedy algorithms are exploited to solve the proposed robust PCA and 2D robust PCA with fast convergence and low computational complexity. Some experimental results on benchmark data sets demonstrate the effectiveness and superiority of the proposed approaches on image reconstruction and recognition.
Pan, P, Xu, Z, Yang, Y, Wu, F & Zhuang, Y 2016, 'Hierarchical recurrent neural encoder for video representation with application to captioning', Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Seattle, WA, USA, pp. 1029-1038.View/Download from: Publisher's site
Recently, deep learning approach, especially deep Convolutional Neural Networks (ConvNets), have achieved overwhelming accuracy with fast processing speed for image classification. Incorporating temporal structure with deep ConvNets for video representation becomes a fundamental problem for video content analysis. In this paper, we propose a new approach, namely Hierarchical Recurrent Neural Encoder (HRNE), to exploit temporal information of videos. Compared to recent video representation inference approaches, this paper makes the following three contributions. First, our HRNE is able to efficiently exploit video temporal structure in a longer range by reducing the length of input information flow, and compositing multiple consecutive inputs at a higher level. Second, computation operations are significantly lessened while attaining more nonlinearity. Third, HRNE is able to uncover temporal transitions between frame chunks with different granularities, i.e. it can model the temporal transitions between frames as well as the transitions between segments. We apply the new method to video captioning where temporal information plays a crucial role. Experiments demonstrate that our method outperforms the state-of-the-art on video captioning benchmarks.
Yang, Y, Gan, C, Lin, M, de Molo, G & Hauptmann, AG 2016, 'Concepts not alone: Exploring pairwise relationships for zero-shot video activity recognition', Website Proceedings of 2016 AAAI Conference on Artificial Intelligence, AAAI, AAAI Conference on Artificial Intelligence, AAAI, Phoenix, Arizona, pp. 3487-3493.
Vast quantities of videos are now being captured at astonishing rates, but the majority of these are not labelled. To cope with such data, we consider the task of content-based activity recognition in videos without any manually labelled examples, also known as zero-shot video recognition. To achieve this, videos are represented in terms of detected visual concepts, which are then scored as relevant or irrelevant according to their similarity with a given textual query. In this paper, we propose a more robust approach for scoring concepts in order to alleviate many of the brittleness and low precision problems of previous work. Not only do we jointly consider semantic relatedness, visual reliability, and discriminative power. To handle noise and non-linearities in the ranking scores of the selected concepts, we propose a novel pairwise order matrix approach for score aggregation. Extensive experiments on the large-scale TRECVID Multimedia Event Detection data show the superiority of our approach.
Yang, Y, Gan, C, Yao, T, Yang, K & Mei, T 2016, 'You lead, we exceed: Labor-free video concept learning by jointly exploiting web videos and images', Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on, IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Las Vegas, pp. 923-932.View/Download from: Publisher's site
Copyright © TRECVID 2016.All rights reserved. In this report, we summarize our solution to TRECVID 2016 Video Localization task. We mainly use Faster R-CNN to localize objects in the spatial domain which is combined with frame-level and shot-level detectors to localize concepts in the temporal domain. We collected images with annotated bounding box from external sources, e.g., ImageNet Detection dataset and manually annotate bounding boxes for categories without any annotations. We trained frame-level detectors using ResNet-200 features pre-trained on ImageNet and for classes of "Running", "Sitting Down" and "Dancing", we also use improved Dense Trajectories features. Finally, we fuse bounding box score, frame score and shot score to get the final score for each bounding box.
Chang, X, Yang, Y, Long, G, Zhang, C & Hauptmann, AG 2016, 'Dynamic concept composition for zero-example event detection', Proceedings of 30th AAAI Conference on Artificial Intelligence, AAAI 2016, AAAI Conference on Artificial Intelligence, Association for the Advancement of Artificial Intelligence, Phoenix, Arizona, United States, pp. 3464-3470.
© Copyright 2016, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.In this paper, we focus on automatically detecting events in unconstrained videos without the use of any visual training exemplars. In principle, zero-shot learning makes it possible to train an event detection model based on the assumption that events (e.g. birthday party) can be described by multiple mid-level semantic concepts (e.g. "blowing candle", "birthday cake"). Towards this goal, we first pre-Train a bundle of concept classifiers using data from other sources. Then we evaluate the semantic correlation of each concept w.r.t. the event of interest and pick up the relevant concept classifiers, which are applied on all test videos to get multiple prediction score vectors. While most existing systems combine the predictions of the concept classifiers with fixed weights, we propose to learn the optimal weights of the concept classifiers for each testing video by exploring a set of online available videos with freeform text descriptions of their content. To validate the effectiveness of the proposed approach, we have conducted extensive experiments on the latest TRECVID MEDTest 2014, MEDTest 2013 and CCV dataset. The experimental results confirm the superiority of the proposed approach.
Yan, Y, Xu, Z, Tsang, W, Long, G & Yang, Y 2016, 'Robust Semi-supervised Learning through Label Aggregation', Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16), AAAI Conference on Artificial Intelligence, AAAI, Phoenix, USA, pp. 2244-2250.
Semi-supervised learning is proposed to exploit both labeled
and unlabeled data. However, as the scale of data in real
world applications increases significantly, conventional semisupervised
algorithms usually lead to massive computational
cost and cannot be applied to large scale datasets. In addition,
label noise is usually present in the practical applications
due to human annotation, which very likely results in remarkable
degeneration of performance in semi-supervised methods.
To address these two challenges, in this paper, we propose
an efficient RObust Semi-Supervised Ensemble Learning
(ROSSEL) method, which generates pseudo-labels for
unlabeled data using a set of weak annotators, and combines
them to approximate the ground-truth labels to assist semisupervised
learning. We formulate the weighted combination
process as a multiple label kernel learning (MLKL) problem
which can be solved efficiently. Compared with other semisupervised
learning algorithms, the proposed method has linear
time complexity. Extensive experiments on five benchmark
datasets demonstrate the superior effectiveness, effi-
ciency and robustness of the proposed algorithm.
Chang, X, Yang, Y, Hauptmann, A, Xing, EP & Yu, YL 2015, 'Semantic Concept Discovery for Large-Scale Zero-Shot Event Detection', Proceedings of the 24th International Conference on Artificial Intelligence, International Joint Conference of Artificial Intelligence, ACM, Buenos Aires, Argentina, pp. 2234-2240.
We focus on detecting complex events in unconstrained Internet videos. While most existing works rely on the abundance of labeled training data, we consider a more difficult zero-shot setting where no training data is supplied. We first pre-train a number of concept classifiers using data from other sources. Then we evaluate the semantic correlation of each concept w.r.t. the event of interest. After further refinement to take prediction inaccuracy and discriminative power into account, we apply the discovered concept classifiers on all test videos and obtain multiple score vectors. These distinct score vectors are converted into pairwise comparison matrices and the nuclear norm rank aggregation framework is adopted to seek consensus. To address the challenging optimization formulation, we propose an efficient, highly scalable algorithm that is an order of magnitude faster than existing alternatives. Experiments on recent TRECVID datasets verify the superiority of the proposed approach. We focus on detecting complex events in unconstrained Internet videos. While most existing works rely on the abundance of labeled training data, we consider a more difficult zero-shot setting where no training data is supplied.We first pre-train a number of concept classifiers using data from other sources. Then we evaluate the semantic correlation of each concept w.r.t. the event of interest. After further refinement to take prediction inaccuracy and discriminative power into account, we apply the discovered concept classifiers on all test videos and obtain multiple score vectors. These distinct score vectors are converted into pairwise comparison matrices and the nuclear norm rank aggregation framework is adopted to seek consensus. To address the challenging optimization formulation, we propose an efficient, highly scalable algorithm that is an order of magnitude faster than existing alternatives. Experiments on recent TRECVID datasets verify the superiority of the proposed appr...
Chang, X, Yang, Y, Xing, EP & Yu, Y-L 2015, 'Complex Event Detection using Semantic Saliency and Nearly-Isotonic SVM', Proceedings of The 32nd International Conference on Machine Learning, International Conference on Machine Learning, International Machine Learning Society, Lille Grand, Paris, pp. 1348-1357.
We aim to detect complex events in long Internet
videos that may last for hours. A major challenge
in this setting is that only a few shots in
a long video are relevant to the event of interest
while others are irrelevant or even misleading.
Instead of indifferently pooling the shots, we first
define a novel notion of semantic saliency that assesses
the relevance of each shot with the event
of interest. We then prioritize the shots according
to their saliency scores since shots that are
semantically more salient are expected to contribute
more to the final event detector. Next, we
propose a new isotonic regularizer that is able to
exploit the semantic ordering information. The
resulting nearly-isotonic SVM classifier exhibits
higher discriminative power. Computationally,
we develop an efficient implementation using
the proximal gradient algorithm, and we prove
new, closed-form proximal steps. We conduct
extensive experiments on three real-world video
datasets and confirm the effectiveness of the proposed
Chang, X, Yu, YL, Yang, Y & Hauptmann, AG 2015, 'Searching persuasively: Joint event detection and evidence recounting with limited supervision', Proceedings of the 23rd ACM international conference on Multimedia, ACM International Conference on Multimedia, ACM, Brisbane, Australia, pp. 581-590.View/Download from: Publisher's site
© 2015 ACM. Multimedia event detection (MED) and multimedia event recounting (MER) are fundamental tasks in managing large amounts of unconstrained web videos, and have attracted a lot of attention in recent years. Most existing systems perform MER as a postprocessing step on top of the MED results. In order to leverage the mutual benefits of the two tasks, we propose a joint framework that simultaneously detects high-level events and localizes the indicative concepts of the events. Our premise is that a good recounting algorithm should not only explain the detection result, but should also be able to assist detection in the first place. Coupled in a joint optimization framework, recounting improves detection by pruning irrelevant noisy concepts while detection directs recounting to the most discriminative evidences. To better utilize the powerful and interpretable semantic video representation, we segment each video into several shots and exploit the rich temporal structures at shot level. The consequent computational challenge is carefully addressed through a significant improvement of the current ADMM algorithm, which, after eliminating all inner loops and equipping novel closed-form solutions for all intermediate steps, enables us to efficiently process extremely large video corpora. We test the proposed method on the large scale TRECVID MEDTest 2014 and MEDTest 2013 datasets, and obtain very promising results for both MED and MER.
Chang, XJ, Nie, FP, Ma, ZG, Yang, Y & Zhou, XF 2015, 'A Convex Formulation for Spectral Shrunk Clustering', Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, AAAI Conference on Artificial Intelligence, AAAI, Austin Texas, USA, pp. 2532-2538.
Spectral clustering is a fundamental technique in the field of data mining and information processing. Most existing spectral clustering algorithms integrate dimensionality reduction into the clustering process assisted by manifold learning in the original space. However, the manifold in reduced-dimensional subspace is likely to exhibit altered properties in contrast with the original space. Thus, applying manifold information obtained from the original space to the clustering process in a low-dimensional subspace is prone to inferior performance. Aiming to address this issue, we propose a novel convex algorithm that mines the manifold structure in the low-dimensional subspace. In addition, our unified learning process makes the manifold learning particularly tailored for the clustering. Compared with other related methods, the proposed algorithm results in more structured clustering result. To validate the efficacy of the proposed algorithm, we perform extensive experiments on several benchmark datasets in comparison with some state-of-the-art clustering approaches. The experimental results demonstrate that the proposed algorithm has quite promising clustering performance
Gan, C, Lin, M, Yang, Y, Zhuang, YT & Hauptmann, AG 2015, 'Exploring Semantic Inter-Class Relationships (SIR) for Zero-Shot Action Recognition', Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, AAAI Conference on Artificial Intelligence, AAAI, Austin, USA, pp. 3769-3775.
Automatically recognizing a large number of action categories from videos is of significant importance for video understanding. Most existing works focused on the design of more discriminative feature representation, and have achieved promising results when the positive samples are enough. However, very limited efforts were spent on recognizing a novel action without any positive exemplars, which is often the case in the real settings due to the large amount of action classes and the users' queries dramatic variations. To address this issue, we propose to perform action recognition when no positive exemplars of that class are provided, which is often known as the zero-shot learning. Different from other zero-shot learning approaches, which exploit attributes as the intermediate layer for the knowledge transfer, our main contribution is SIR, which directly leverages the semantic inter-class relationships between the known and unknown actions followed by label transfer learning. The inter-class semantic relationships are automatically measured by continuous word vectors, which learned by the skip-gram model using the large-scale text corpus. Extensive experiments on the UCF101 dataset validate the superiority of our method over fully-supervised approaches using few positive exemplars
Gan, C, Wang, N, Yang, Y, Yeung, DY & Hauptmann, AG 2015, 'Devnet: A Deep Event Network for Multimedia Event Detection and Evidence Recounting', Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Boston, USA, pp. 2568-2577.View/Download from: Publisher's site
In this paper, we focus on complex event detection in internet videos while also providing the key evidences of the detection results. Convolutional Neural Networks (CNNs) have achieved promising performance in image classification and action recognition tasks. However, it remains an open problem how to use CNNs for video event detection and recounting, mainly due to the complexity and diversity of video events. In this work, we propose a flexible deep CNN infrastructure, namely Deep Event Network (DevNet), that simultaneously detects pre-defined events and provides key spatial-temporal evidences. Taking key frames of videos as input, we first detect the event of interest at the video level by aggregating the CNN features of the key frames. The pieces of evidences which recount the detection results, are also automatically localized, both temporally and spatially. The challenge is that we only have video level labels, while the key evidences usually take place at the frame levels. Based on the intrinsic property of CNNs, we first generate a spatial-temporal saliency map by back passing through DevNet, which then can be used to find the key frames which are most indicative to the event, as well as to localize the specific spatial position, usually an object, in the frame of the highly indicative area. Experiments on the large scale TRECVID 2014 MEDTest dataset demonstrate the promising performance of our method, both for event detection and evidence recounting.
Jiang, L, Yu, SI, Meng, D, Yang, Y, Mitamura, T & Hauptmann, AG 2015, 'Fast and accurate content-based semantic search in 100M Internet Videos', MM 2015 - Proceedings of the 2015 ACM Multimedia Conference, ACM International Conference on Multimedia, ACM, Brisbane, Australia, pp. 49-58.View/Download from: Publisher's site
© 2015 ACM. Large-scale content-based semantic search in video is an interesting and fundamental problem in multimedia analysis and retrieval. Existing methods index a video by the raw concept detection score that is dense and inconsistent, and thus cannot scale to "big data" that are readily available on the Internet. This paper proposes a scalable solution. The key is a novel step called concept adjustment that represents a video by a few salient and consistent concepts that can be efficiently indexed by the modified inverted index. The proposed adjustment model relies on a concise optimization framework with interpretations. The proposed index leverages the text-based inverted index for video retrieval. Experimental results validate the efficacy and the efficiency of the proposed method. The results show that our method can scale up the semantic search while maintaining state-of-Theart search performance. Specifically, the proposed method (with reranking) achieves the best result on the challenging TRECVID Multimedia Event Detection (MED) zeroexample task. It only takes 0.2 second on a single CPU core to search a collection of 100 million Internet videos.
Liu, G, Yan, Y, Ricci, E, Yang, Y, Han, Y, Winkler, S & Sebe, N 2015, 'Inferring Painting Style with Multi-task Dictionary Learning', IJCAI'15 Proceedings of the 24th International Conference on Artificial Intelligence, International Joint Conference on Artificial Intelligence, AAAI Press, Buenos Aires, pp. 2162-2168.
Nie, LQ, Zhang, LM, Yang, Y, Wang, M, Hong, R & Chua, TS 2015, 'Beyond Doctors: Future Health Prediction from Multimedia and Multimodal Observations', Proceedings of the 2015 ACM Multimedia Conference, ACM International Conference on Multimedia, ACM, Brisbane, Australia, pp. 591-600.View/Download from: Publisher's site
Although chronic diseases cannot be cured, they can be effectively controlled as long as we understand their progressions based on the current observational health records, which is often in the form of multimedia data. A large and growing body of literature has investigated the disease progression problem. However, far too little attention to date has been paid to jointly consider the following three observations of the chronic disease progression: 1) the health statuses at different time points are chronologically similar; 2) the future health statuses of each patient can be comprehensively revealed from the current multimedia and multimodal observations, such as visual scans, digital measurements and textual medical histories; and 3) the discriminative capabilities of different modalities vary significantly in accordance to specific diseases. In the light of these, we propose an adaptive multimodal multi-task learning model to co-regularize the modality agreement, temporal progression and discriminative capabilities of different modalities. We theoretically show that our proposed model is a linear system. Before training our model, we address the data missing problem via the matrix factorization approach. Extensive evaluations on a real-world Alzheimer's disease dataset well verify our proposed model. It should be noted that our model is also applicable to other chronic diseases.
Wu, Song, Yang, Y, Li, Zhang & Zhuang 2015, 'Structured Embedding via Pairwise Relations and Long-Range Interactions in Knowledge Base', Proceedings of the National Conference on Artificial Intelligence, AAAI Conference on Artificial Intelligence, AAAI, Austin, Texas, USA.
We consider the problem of embedding entities and relations of knowledge bases into low-dimensional continuous vector spaces (distributed representations). Unlike most existing approaches, which are primarily efficient for modelling pairwise relations between entities, we attempt to explicitly model both pairwise relations and long-range interactions between entities, by interpreting them as linear operators on the low-dimensional embeddings of the entities. Therefore, in this paper we introduces Path-Ranking to capture the long-range interactions of knowledge graph and at the same time preserve the pairwise relations of knowledge graph; we call it 'structured embedding via pairwise relation and long-range interactions' (referred to as SePLi). Comparing with the-state-of-the-art models, SePLi achieves better performances of embeddings.
Xu, ZW, Yang, Y & Hauptmann, AG 2015, 'A Discriminative CNN Video Representation for Event Detection', 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Boston, USA.View/Download from: Publisher's site
In this paper, we propose a discriminative video representation for event detection over a large scale video dataset when only limited hardware resources are available. The focus of this paper is to effectively leverage deep Convolutional Neural Networks (CNNs) to advance event detection, where only frame level static descriptors can be extracted by the existing CNN toolkits. This paper makes two contributions to the inference of CNN video representation. First, while average pooling and max pooling have long been the standard approaches to aggregating frame level static features, we show that performance can be significantly improved by taking advantage of an appropriate encoding method. Second, we propose using a set of latent concept descriptors as the frame descriptor, which enriches visual information while keeping it computationally affordable. The integration of the two contributions results in a new state-of-the-art performance in event detection over the largest video datasets. Compared to improved Dense Trajectories, which has been recognized as the best video representation for event detection, our new representation improves the Mean Average Precision (mAP) from 27.6% to 36.8% for the TRECVID MEDTest 14 dataset and from 34.0% to 44.6% for the TRECVID MEDTest 13 dataset.
Yan, Y, Yang, Y, Shen, H, Meng, D, Liu, GW, Hauptmann, AG & Sebe, N 2015, 'Complex Event Detection via Event Oriented Dictionary Learning', Proceedings of the 29th AAAI Conference on Artificial Intelligence, AAAI Conference on Artificial Intelligence, AAAI, Austin, USA, pp. 3841-3847.
Complex event detection is a retrieval task with the goal of finding videos of a particular event in a large-scale unconstrained internet video archive, given example videos and text descriptions. Nowadays, different multimodal fusion schemes of low-level and high-level features are extensively investigated and evaluated for the complex event detection task. However, how to effectively select the high-level semantic meaningful concepts from a large pool to assist complex event detection is rarely studied in the literature. In this paper, we propose two novel strategies to automatically select semantic meaningful concepts for the event detection task based on both the events-kit text descriptions and the concepts high-level feature descriptions. Moreover, we introduce a novel event oriented dictionary representation based on the selected semantic concepts. Towards this goal, we leverage training samples of selected concepts from the Semantic Indexing (SIN) dataset with a pool of 346 concepts, into a novel supervised multi-task dictionary learning framework. Extensive experimental results on TRECVID Multimedia Event Detection (MED) dataset demonstrate the efficacy of our proposed method.
Yang, Y, Yu, SI, Jiang, L, Xu, ZW & Hauptmann, AG 2015, 'Content-Based Video Search over 1 Million Videos with 1 Core in 1 Second', Proceedings of the 5th ACM on International Conference on Multimedia Retrieval, International Conference on Multimedia Retrieval, ACM, Shanghai, China, pp. 419-426.View/Download from: Publisher's site
Many content-based video search (CBVS) systems have been proposed to analyze the rapidly-increasing amount of user-generated videos on the Internet. Though the accuracy of CBVS systems have drastically improved, these high accuracy systems tend to be too inefficient for interactive search. Therefore, to strive for real-time web-scale CBVS, we perform a comprehensive study on the different components in a CBVS system to understand the trade-offs between accuracy and speed of each component. Directions investigated include exploring different low-level and semantics-based features, testing different compression factors and approximations during video search, and understanding the time v.s. accuracy trade-off of reranking. Extensive experiments on data sets consisting of more than 1,000 hours of video showed that through a combination of effective features, highly compressed representations, and one iteration of reranking, our proposed system can achieve an 10,000-fold speedup while retaining 80% accuracy of a state-of-the-art CBVS system. We further performed search over 1 million videos and demonstrated that our system can complete the search in 0.975 seconds with a single core, which potentially opens the door to interactive web-scale CBVS for the general public.
Zhang, L, Yang, Y & Zimmermann, R 2015, 'Fine-grained image categorization by localizing tiny object parts from unannotated images', Proceedings of the 5th ACM on International Conference on Multimedia Retrieval, Annual ACM International Conference on Multimedia Retrieval (ICMR), ACM, Shanghai, China, pp. 107-114.View/Download from: Publisher's site
This paper proposes a novel fine-grained image categorization model where no object annotation is required in the training/testing stage. The key technique is a dense graph mining algorithm that localizes multi-scale discriminative object parts in each image. In particular, to mimick human hierarchical perception mechanism, a super-pixel pyramid is generated for each image, based on which graphlets from each layer are constructed to seamlessly describe object parts. We observe that graphlets representative to each category are densely distributed in the feature space. Therefore a dense graph mining algorithm is developed to discover graphlets representative to each sub- super-category. Finally, the discovered graphlets from pairwise images are encoded into an image kernel for fine-grained recognition. Experiments on the UCB-200  shown that our method performs competitively to many models relying on the annotated bird parts.
Yan, Y, Tan, M, Yang, Y, Tsang, I, Zhang, C & Shi, Q 2015, 'Scalable maximum margin matrix factorization by active riemannian subspace search', Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 2015), International Joint Conference on Artificial Intelligence, AAAI, Buenos Aires, Argentina, pp. 3988-3994.
The user ratings in recommendation systems are
usually in the form of ordinal discrete values. To
give more accurate prediction of such rating data,
maximum margin matrix factorization (M3F) was
proposed. Existing M3F algorithms, however, either
have massive computational cost or require expensive
model selection procedures to determine
the number of latent factors (i.e. the rank of the
matrix to be recovered), making them less practical
for large scale data sets. To address these two
challenges, in this paper, we formulate M3F with
a known number of latent factors as the Riemannian
optimization problem on a fixed-rank matrix
manifold and present a block-wise nonlinear Riemannian
conjugate gradient method to solve it ef-
ficiently. We then apply a simple and efficient active
subspace search scheme to automatically detect
the number of latent factors. Empirical studies on
both synthetic data sets and large real-world data
sets demonstrate the superior efficiency and effectiveness
of the proposed method.
Xu, Z, Tsang, W, Yang, Y, Ma, Z & Hauptmann, AG 2014, 'Event Detection using Multi-Level Relevance Labels and Multiple Features', 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Columbus, OH, pp. 97-104.View/Download from: Publisher's site
We address the challenging problem of utilizing related exemplars for complex event detection while multiple features are available. Related exemplars share certain positive elements of the event, but have no uniform pattern due to the huge variance of relevance levels among different related exemplars. None of the existing multiple feature fusion methods can deal with the related exemplars. In this paper, we propose an algorithm which adaptively utilizes the related exemplars by cross-feature learning. Ordinal labels are used to represent the multiple relevance levels of the related videos. Label candidates of related exemplars are generated by exploring the possible relevance levels of each related exemplar via a cross-feature voting strategy. Maximum margin criterion is then applied in our framework to discriminate the positive and negative exemplars, as well as the related exemplars from different relevance levels. We test our algorithm using the large scale TRECVID 2011 dataset and it gains promising performance.
Ballas, N, Yang, Y, Lan, ZZ, Delezoide, B, Preteux, F & Hauptmann, A 2014, 'Space-time robust representation for action recognition', Proceedings of the IEEE International Conference on Computer Vision, IEEE International Conference on Computer Vision, IEEE, Sydney, NSW, Australia, pp. 2704-2711.View/Download from: Publisher's site
We address the problem of action recognition in unconstrained videos. We propose a novel content driven pooling that leverages space-time context while being robust toward global space-time transformations. Being robust to such transformations is of primary importance in unconstrained videos where the action localizations can drastically shift between frames. Our pooling identifies regions of interest using video structural cues estimated by differ ent saliency functions. To combine the different structural information, we introduce an iterative structure learning algorithm, WSVM (weighted SVM), that determines the optimal saliency layout of an action model through a sparse regularizer. A new optimization method is proposed to solve the WSVM' highly non-smooth objective function. We evaluate our approach on standard action datasets (KTH, UCF50 and HMDB). Most noticeably, the accuracy of our algorithm reaches 51.8% on the challenging HMDB dataset which outperforms the state-of-the-art of 7.3% relatively. © 2013 IEEE.
Chang, Nie, Yang, Y & Huang 2014, 'A Convex Formulation for Semi-supervised Multi-Label Feature Selection', Proceedings of the Twenty-Eigth AAAI Conference on Artificial Intelligence, AAAI Conference on Artificial Intelligence, AAAI Press, Canada.
Gao, C, Yang, Y, Liu, G, Meng, D, Cai, Y, Xu, S, Tong, W, Shen, H & Hauptmann, AG 2014, 'Interactive surveillance event detection through mid-level discriminative representation', Proceedings of the ACM International Conference on Multimedia Retrieval 2014, ACM International Conference on Multimedia Retrieval, ACM, Scotland, pp. 305-312.View/Download from: Publisher's site
Event detection from real surveillance videos with complicated background environment is always a very hard task. Different from the traditional retrospective and interactive systems designed on this task, which are mainly executed on video fragments located within the event-occurrence time, in this paper we propose a new interactive system constructed on the mid-level discriminative representations (patches/ shots) which are closely related to the event (might occur beyond the event-occurrence period) and are easier to be detected than video fragments. By virtue of such easilydistinguished mid-level patterns, our framework realizes an effective labor division between computers and human participants. The task of computers is to train classifiers on a bunch of mid-level discriminative representations, and to sort all the possible mid-level representations in the evaluation sets based on the classifier scores. The task of human participants is then to readily search the events based on the clues offered by these sorted mid-level representations. For computers, such mid-level representations, with more concise and consistent patterns, can be more accurately detected than video fragments utilized in the conventional framework, and on the other hand, a human participant can always much more easily search the events of interest implicated by these location-anchored mid-level representations than conventional video fragments containing entire scenes. Both of these two properties facilitate the availability of our framework in real surveillance event detection applications. Copyright is held by the owner/author(s).
Jiang, L, Miao, Y, Yang, Y, Lan, Z & Hauptmann, AG 2014, 'Viral video style: A closer look at viral videos on YouTube', Proceedings of the ACM International Conference on Multimedia Retrieval 2014, ACM International Conference on Multimedia Retrieval, ACM, Scotland, pp. 193-200.View/Download from: Publisher's site
Viral videos that gain popularity through the process of Internet sharing are having a profound impact on society. Existing studies on viral videos have only been on small or confidential datasets. We collect by far the largest open benchmark for viral video study called CMU Viral Video Dataset, and share it with researchers from both academia and industry. Having verified existing observations on the dataset, we discover some interesting characteristics of viral videos. Based on our analysis, in the second half of the paper, we propose a model to forecast the future peak day of viral videos. The application of our work is not only important for advertising agencies to plan advertising campaigns and estimate costs, but also for companies to be able to quickly respond to rivals in viral marketing campaigns. The proposed method is unique in that it is the first attempt to incorporate video metadata into the peak day prediction. The empirical results demonstrate that the proposed method outperforms the state-of-the-art methods, with statistically significant differences. Copyright 2014 ACM.
Lan, ZZ, Yang, Y, Ballas, N, Yu, SI & Haputmann, A 2014, 'Resource constrained multimedia event detection', Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), International Conference on Multimedia Modeling, pp. 388-399.View/Download from: Publisher's site
We present a study comparing the cost and efficiency tradeoffs of multiple features for multimedia event detection. Low-level as well as semantic features are a critical part of contemporary multimedia and computer vision research. Arguably, combinations of multiple feature sets have been a major reason for recent progress in the field, not just as a low dimensional representations of multimedia data, but also as a means to semantically summarize images and videos. However, their efficacy for complex event recognition in unconstrained videos on standardized datasets has not been systematically studied. In this paper, we evaluate the accuracy and contribution of more than 10 multi-modality features, including semantic and low-level video representations, using two newly released NIST TRECVID Multimedia Event Detection (MED) open source datasets, i.e. MEDTEST and KINDREDTEST, which contain more than 1000 hours of videos. Contrasting multiple performance metrics, such as average precision, probability of missed detection and minimum normalized detection cost, we propose a framework to balance the trade-off between accuracy and computational cost. This study provides an empirical foundation for selecting feature sets that are capable of dealing with large-scale data with limited computational resources and are likely to produce superior multimedia event detection accuracy. This framework also applies to other resource limited multimedia analyses such as selecting/fusing multiple classifiers and different representations of each feature set. © 2014 Springer International Publishing.
Li, Li, Wang, Yang, Y, Zhang & Zhou 2014, 'Overcoming Semantic Drift in Information Extraction', Proc. 17th International Conference on Extending Database Technology (EDBT), Extending Database Technology, ACM, Greece.View/Download from: Publisher's site
Ma, Yang, Y, Sebe & Hauptmann 2014, 'Multiple Features But Few Labels? A Symbiotic Solution Exemplified for Video Analysis', MM '14 Proceedings of the 22nd ACM international conference on Multimedia, ACM International Conference on Multimedia, ACM, Orlando, FL, USA.View/Download from: Publisher's site
Peng, Meng, Xu, Gao, Yang, Y & Zhang 2014, 'Decomposable Nonlocal Tensor Dictionary Learning for Multispectral Image Denoising', 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Columbus, Ohio, USA.View/Download from: Publisher's site
As compared to the conventional RGB or gray-scale images, multispectral images (MSI) can deliver more faithful representation for real scenes, and enhance the performance of many computer vision tasks. In practice, however, an MSI is always corrupted by various noises. In this paper we propose an effective MSI denoising approach by combinatorially considering two intrinsic characteristics underlying an MSI: the nonlocal similarity over space and the global correlation across spectrum. In specific, by explicitly considering spatial self-similarity of an MSI we construct a nonlocal tensor dictionary learning model with a group-block-sparsity constraint, which makes similar full-band patches (FBP) share the same atoms from the spatial and spectral dictionaries. Furthermore, through exploiting spectral correlation of an MSI and assuming over-redundancy of dictionaries, the constrained nonlocal MSI dictionary learning model can be decomposed into a series of unconstrained low-rank tensor approximation problems, which can be readily solved by off-the-shelf higher order statistics. Experimental results show that our method outperforms all state-of-the-art MSI denoising methods under comprehensive quantitative performance measures.
Xu, Ye, Li, Liu, Yang, Y & Ding 2014, 'Dynamic Background Learning through Deep Auto-encoder Networks', MM '14 Proceedings of the 22nd ACM international conference on Multimedia, ACM International Conference on Multimedia, ACM, Orlando, FL, USA.View/Download from: Publisher's site
Xu, Z, Yang, Y, Kassim, A & Yan, S 2014, 'Cross-media relevance mining for evaluating text-based image search engine', 2014 IEEE International Conference on Multimedia and Expo Workshops, ICMEW 2014, IEEE International Conference on Multimedia and Expo Workshops, IEEE, Chengdu, China.View/Download from: Publisher's site
© 2014 IEEE. Targeted at MSR-Bing Image Retrieval grand challenge, we accumulate the experience from the one in 2013, and the make further investigation into different models to solve the relevance assessment problem. Generally speaking, the assessment of relevance between text query and image list is very hard due to the semantic gap. It's not easy to find the 'mapping' from text query into the visual world. Solutions from 2013 MSR-Bing grand challenge are discussed in this paper. Combining with our own observation, we give some insights into why some solutions work well, while others not. Our main solution is to combine the deep learning features with the wining solution of last year (average similarity measurement and page rank), since the deep learning features have more compact representation than the traditional BoWs features, and deep learning features are efficient (on a descent GPU) with very good performance. Our solution achieved the 1st place in MSR-Bing grand challenge 2014. Finally, we give the running time of our solution in the testing phase for the 2014 ICME testing set and development set, respectively.
Yang, Y, Ma, Z, Xu, Z, Yan, S & Hauptmann, AG 2014, 'How related exemplars help complex event detection in web videos?', Proceedings of the IEEE International Conference on Computer Vision, IEEE International Conference on Computer Vision, IEEE, Sydney, NSW, Australia, pp. 2104-2111.View/Download from: Publisher's site
Compared to visual concepts such as actions, scenes and objects, complex event is a higher level abstraction of longer video sequences. For example, a 'marriage proposal' event is described by multiple objects (e.g., ring, faces), scenes (e.g., in a restaurant, outdoor) and actions (e.g., kneeling down). The positive exemplars which exactly convey the precise semantic of an event are hard to obtain. It would be beneficial to utilize the related exemplars for complex event detection. However, the semantic correlations between related exemplars and the target event vary substantially as relatedness assessment is subjective. Two related exemplars can be about completely different events, e.g., in the TRECVID MED dataset, both bicycle riding and equestrianism are labeled as related to 'attempting a bike trick' event. To tackle the subjectiveness of human assessment, our algorithm automatically evaluates how positive the related exemplars are for the detection of an event and uses them on an exemplar-specific basis. Experiments demonstrate that our algorithm is able to utilize related exemplars adaptively, and the algorithm gains good performance for complex event detection. © 2013 IEEE.
Yang, Y, Shen, Yu, Meng & Hauptmann 2014, 'Unsupervised Video Adaptation for Parsing Human Motion', Lecture Notes in Computer Science - Computer Vision – ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, European Conference on Computer Vision, Springer, Switzerland.View/Download from: Publisher's site
Yu, Z, Wu, F, Yang, Y, Tian, Q, Luo, J & Zhuang, Y 2014, 'Discriminative coupled dictionary hashing for fast cross-media retrieval', SIGIR 2014 - Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM International Conference on Research and Development in Information Retrieval, ACM, Australia, pp. 395-404.View/Download from: Publisher's site
Cross-media hashing, which conducts cross-media retrieval by embedding data from different modalities into a common low-dimensional Hamming space, has attracted intensive attention in recent years. The existing cross-media hashing approaches only aim at learning hash functions to preserve the intra-modality and inter-modality correlations, but do not directly capture the underlying semantic information of the multi-modal data. We propose a discriminative coupled dictionary hashing (DCDH) method in this paper. In DCDH, the coupled dictionary for each modality is learned with side information (e.g., categories). As a result, the coupled dictionaries not only preserve the intra-similarity and inter-correlation among multi-modal data, but also contain dictionary atoms that are semantically discriminative (i.e., the data from the same category is reconstructed by the similar dictionary atoms). To perform fast cross-media retrieval, we learn hash functions which map data from the dictionary space to a low-dimensional Hamming space. Besides, we conjecture that a balanced representation is crucial in cross-media retrieval. We introduce multi-view features on the relatively "weak" modalities into DCDH and extend it to multiview DCDH (MV-DCDH) in order to enhance their representation capability. The experiments on two real-world data sets show that our DCDH and MV- DCDH outperform the state-of-the-art methods significantly on cross-media retrieval. Copyright 2014 ACM.
Yu, Z, Zhang, Y, Tang, S, Yang, Y, Tian, Q & Luo, J 2014, 'CROSS-MEDIA HASHING WITH KERNEL REGRESSION', 2014 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), IEEE International Conference on Multimedia and Expo Workshops (ICMEW), IEEE, Chengdu, PEOPLES R CHINA.
Zhang, L, Yang, Y & Zimmermann, R 2014, 'Discriminative cellets discovery for fine-grained image categories retrieval', Proceedings of the ACM International Conference on Multimedia Retrieval 2014, ACM International Conference on Multimedia Retrieval, ACM, Scotland, pp. 57-64.View/Download from: Publisher's site
Fine-grained image categories recognition is a challenging task aiming at distinguishing objects belonging to the same basic-level category, such as leaf or mushroom. It is a useful technique that can be applied for species recognition, face verification, and etc. Most of the existing methods have difficulties to automatically detect discriminative object components. In this paper, we propose a new fine- grained image categorization model that can be deemed as an improved version spatial pyramid matching (SPM). In- stead of the conventional SPM that enumeratively conducts cell-to-cell matching between images, the proposed model combines multiple cells into cellets that are highly responsive to object fine-grained categories. In particular, we describe object components by cellets that connect spatially adjacent cells from the same pyramid level. Straightforwardly, image categorization can be casted as the matching between cellets extracted from pairwise images. Toward an effective matching process, a hierarchical sparse coding algorithm is derived that represents each cellet by a linear combination of the basis cellets. Further, a linear discriminant analy- sis (LDA)-like scheme is employed to select the cellets with high discrimination. On the basis of the feature vector built from the selected cellets, fine-grained image categorization is conducted by training a linear SVM. Experimental results on the Caltech-UCSD birds, the Leeds butterflies, and the COSMIC insects data sets demonstrate our model out- performs the state-of-the-art. Besides, the visualized cellets show discriminative object parts are localized accurately. Copyright 2014 ACM.
Xu, Z, Yang, Y, Tsang, I, Hauptmann, A & Sebe, N 2013, 'Feature Weighting via Optimal Thresholding for Video Analysis', Proceedings of the 2013 IEEE International Conference on Computer Vision, IEEE International Conference on Computer Vision, IEEE, Sydney, NSW, Australia, pp. 3340-3447.View/Download from: Publisher's site
Fusion of multiple features can boost the performance of large-scale visual classification and detection tasks like TRECVID Multimedia Event Detection (MED) competition . In this paper, we propose a novel feature fusion approach, namely Feature Weighting via Optimal Thresholding (FWOT) to effectively fuse various features. FWOT learns the weights, thresholding and smoothing parameters in a joint framework to combine the decision values obtained from all the individual features and the early fusion. To the best of our knowledge, this is the first work to consider the weight and threshold factors of fusion problem simultaneously. Compared to state-of-the-art fusion algorithms, our approach achieves promising improvements on HMDB  action recognition dataset and CCV  video classification dataset. In addition, experiments on two TRECVID MED 2011 collections show that our approach outperforms the state-of-the-art fusion methods for complex event detection.
Cai, Y, Yang, Y, Hauptmann, AG & Wactlar, HD 2013, 'A cognitive assistive system for monitoring the use of home medical devices', MIIRH 2013 - Proceedings of the 1st ACM International Workshop on Multimedia Indexing and Information Retrieval for Heathcare, Co-located with ACM Multimedia 2013, ACM International Conference on Multimedia, ACM, Barcelona, Spain, pp. 59-66.View/Download from: Publisher's site
Despite the popularity of home medical devices, serious safety concerns have been raised, because the use-errors of home medical devices have linked to a large number of fatal hazards. To resolve the problem, we introduce a cognitive assistive system to automatically monitor the use of home medical devices. Being able to accurately recognize user operations is one of the most important functionalities of the proposed system. However, even though various action recognition algorithms have been proposed in recent years, it is still unknown whether they are adequate for recognizing operations in using home medical devices. Since the lack of the corresponding database is the main reason causing the situation, at the first part of this paper, we present a database specially designed for studying the use of home medical devices. Then, we evaluate the performance of the existing approaches on the proposed database. Although using state-of-art approaches which have demonstrated near perfect performance in recognizing certain general human actions, we observe significant performance drop when applying it to recognize device operations. We conclude that the tiny action involved in using devices is one of the most important reasons leading to the performance decrease. To accurately recognize tiny actions, it's critical to focus on where the target action happens, namely the region of interest(ROI) and have more elaborate action modeling based on the ROI. Therefore, in the second part of this paper, we introduce a simple but effective approach to estimating ROI for recognizing tiny actions. The key idea of this method is to analyze the correlation between an action and the sub-regions of a frame. The estimated ROI is then used as a filter for building more accurate action representations. Experimental results show significant performance improvements over the baseline methods by using the estimated ROI for action recognition. © 2013 ACM.
Cao, X, Wei, X, Han, Y, Yang, Y & Lin, D 2013, 'Robust tensor clustering with non-greedy maximization', IJCAI International Joint Conference on Artificial Intelligence, International Joint Conference on Artificial Intelligence, AAAI Press, Beijing China, pp. 1254-1259.
Tensors are increasingly common in several areas such as data mining, computer graphics, and computer vision. Tensor clustering is a fundamental tool for data analysis and pattern discovery. However, there usually exist outlying data points in realworld datasets, which will reduce the performance of clustering. This motivates us to develop a tensor clustering algorithm that is robust to the outliers. In this paper, we propose an algorithm of Robust Tensor Clustering (RTC). The RTC firstly finds a lower rank approximation of the original tensor data using a L1 norm optimization function. Because the L1 norm doesn't exaggerate the effect of outliers compared with L2 norm, the minimization of the L1 norm approximation function makes RTC robust to outliers. Then we compute the HOSVD decomposition of this approximate tensor to obtain the final clustering results. Different from the traditional algorithm solving the approximation function with a greedy strategy, we utilize a non-greedy strategy to obtain a better solution. Experiments demonstrate that RTC has better performance than the state-ofthe- art algorithms and is more robust to outliers.
Chen, MY, Hauptmann, A, Bharucha, A, Wactlar, H & Yang, Y 2011, 'Human activity analysis for geriatric care in nursing homes', The Era of Interactive Media, Pacific-Rim Conference on Multimedia, Springer, Sydney, Australia, pp. 53-61.View/Download from: Publisher's site
© 2013 Springer Science+Business Media, LLC. All rights reserved. As our society is increasingly aging, it is urgent to develop computer aided techniques to improve the quality-of-care (QoC) and quality-of-life (QoL) of geriatric patients. In this paper, we focus on automatic human activities analysis in video surveillance recorded in complicated environments at a nursing home. This will enable the automatic exploration of the statistical patterns between patients' daily activities and their clinical diagnosis. We also discuss potential future research directions in this area. Experiment demonstrate the proposed approach is effective for human activity analysis.
Han, Yang, Y & Zhou 2013, 'Co-Regularized Ensemble for Feature Selection', Proceedings of the 23rd International Joint Conference on Artificial Intelligence (IJCAI 13), International Joint Conference on Artificial Intelligence, International Joint Conferences on Artificial Intelligence, Beijing, China.
Ma, Yang, Y, Nie & Sebe 2013, 'Thinking of Images as What They Are: Compound Matrix Regression for Image Classification', Proceedings of the 23rd International Joint Conference on Artificial Intelligence (IJCAI 13), International Joint Conference on Artificial Intelligence, International Joint Conferences on Artificial Intelligence, Beijing, China.
Ma, Yang, Y, Xu, Sebe & Hauptmann 2013, 'We Are Not Equally Negative: Fine-grained Labeling for Multimedia Event Detection', MM '13 Proceedings of the 21st ACM international conference on Multimedia, ACM International Conference on Multimedia, ACM, Barcelona, Spain.View/Download from: Publisher's site
Ma, Yang, Y, Xu, Sebe, Yan & Hauptmann 2013, 'Complex Event Detection via Multi-Source Video Attributes', 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Portland, OR, USA.View/Download from: Publisher's site
Song, J, Yang, Y, Yang, Y, Huang, Z & Shen, HT 2013, 'Inter-media hashing for large-scale retrieval from heterogeneous data sources', Proceedings of the ACM SIGMOD International Conference on Management of Data, ACM Special Interest Group on Management of Data Conference, ACM, New York, New York, USA, pp. 785-796.View/Download from: Publisher's site
In this paper, we present a new multimedia retrieval paradigm to innovate large-scale search of heterogenous multimedia data. It is able to return results of different media types from heterogeneous data sources, e.g., using a query image to retrieve relevant text documents or images from different data sources. This utilizes the widely available data from different sources and caters for the current users' demand of receiving a result list simultaneously containing multiple types of data to obtain a comprehensive understanding of the query's results. To enable large-scale inter-media retrieval, we propose a novel inter-media hashing (IMH) model to explore the correlations among multiple media types from different data sources and tackle the scalability issue. To this end, multimedia data from heterogeneous data sources are transformed into a common Hamming space, in which fast search can be easily implemented by XOR and bit-count operations. Furthermore, we integrate a linear regression model to learn hashing functions so that the hash codes for new data points can be efficiently generated. Experiments conducted on real-world large-scale multimedia datasets demonstrate the superiority of our proposed method compared with state-of-the-art techniques. Copyright © 2013 ACM.
Wang, S, Xu, Z, Yang, Y, Li, X, Pang, C & Haumptmann, AG 2013, 'Fall detection in multi-camera surveillance videos: Experimentations and observations', MIIRH 2013 - Proceedings of the 1st ACM International Workshop on Multimedia Indexing and Information Retrieval for Heathcare, Co-located with ACM Multimedia 2013, pp. 33-38.View/Download from: Publisher's site
This paper presents our study on fall detection for ageing care monitoring. We collected a choreographed multi-camera dataset that contains fall actions and other actions such as walking, standing up, sitting down and so forth. In our work, MoSIFT feature is extracted from the videos recorded by each camera. We conduct a series of experiments to show the performance variations of fall detection when different methods are used. We first compare the performance of the standard Bag-of-Words and spatial Bag-of-Words with different codebook sizes. Then, we test different fusion methods which combines the information from the videos recorded by two orthogonally deployed cameras, where a non-linear χ2 kernel Support Vector Machine (SVM) is trained to detect fall actions. In addition, we also use explicit feature maps along with linear kernel for fall detection and compare it to the standard bag of word representation with a non-linear χ2 kernel. Our experiment results show that late fusion of Bag-of-Words with a 1000 centers codebook obtains the best performance. The best result reaches 90.46% in average precision, which in turn may provide a more independent and safer living environment for the elderly. © 2013 ACM.
Wu, F, Tan, X, Yang, Y, Tao, D, Tang, S & Zhuang, Y 2013, 'Supervised Nonnegative Tensor Factorization with Maximum-Margin Constraint', Twenty-Seventh AAAI Conference on Artificial Intelligence, AAAI Conference on Artificial Intelligence, AAAI Press, Bellevue, Washington, USA, pp. 962-968.
Non-negative tensor factorization (NTF) has attracted great attention in the machine learning community. In this paper, we extend traditional non-negative tensor factorization into a supervised discriminative decomposition, referred as Supervised Non-negative Tensor Factorization with Maximum-Margin Constraint(SNTFM2). SNTFM2 formulates the optimal discriminative factorization of non-negative tensorial data as a coupled least-squares optimization problem via a maximum-margin method. As a result, SNTFM2 not only faithfully approximates the tensorial data by additive combinations of the basis, but also obtains a strong generalization power to discriminative analysis (in particularfor classification in this paper). The experimental results show the superiority of our proposed model over state-of-the-art techniques on both toy and real world data sets.
Yu, Yang, Y & Hauptmann 2013, 'Harry Potter's Marauder's Map: Localizing and Tracking Multiple Persons-of-Interest by Nonnegative Discretization', 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Portland, USA.View/Download from: Publisher's site
Zheng, Shang, Yuan & Yang, Y 2013, 'Towards Efficient Search for Activity Trajectories', 2013 IEEE 29TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE), IEEE International Conference on Data Engineering, IEEE, Brisbane, QLD, Australia.View/Download from: Publisher's site
The advances in location positioning and wireless communication technologies have led to a myriad of spatial trajectories representing the mobility of a variety of moving objects. While processing trajectory data with the focus of spatio-temporal features has been widely studied in the last decade, recent proliferation in location-based web applications (e.g., Foursquare, Facebook) has given rise to large amounts of trajectories associated with activity information, called activity trajectory. In this paper, we study the problem of efficient similarity search on activity trajectory database. Given a sequence of query locations, each associated with a set of desired activities, an activity trajectory similarity query (ATSQ) returns k trajectories that cover the query activities and yield the shortest minimum match distance. An order-sensitive activity trajectory similarity query (OATSQ) is also proposed to take into account the order of the query locations. To process the queries efficiently, we firstly develop a novel hybrid grid index, GAT, to organize the trajectory segments and activities hierarchically, which enables us to prune the search space by location proximity and activity containment simultaneously. In addition, we propose algorithms for efficient computation of the minimum match distance and minimum order-sensitive match distance, respectively. The results of our extensive empirical studies based on real online check-in datasets demonstrate that our proposed index and methods are capable of achieving superior performance and good scalability.
Cao, L, Ji, R, Gao, Y, Yang, Y & Tian, Q 2012, 'Weakly supervised sparse coding with geometric consistency pooling', Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Providence, RI, USA, pp. 3578-3585.View/Download from: Publisher's site
Most recently the Bag-of-Features (BoF) representation has been well advocated for image search and classification, with two decent phases named sparse coding and max pooling to compensate quantization loss as well as inject spatial layouts. But still, much information has been discarded by quantizing local descriptors with two-dimensional layouts into a one-dimensional BoF histogram. In this paper, we revisit this popular sparse coding max pooling paradigm by looking around the local descriptor context towards an optimal BoF. First, we introduce a Weakly supervised Sparse Coding (WSC) to exploit the Classemes-based attribute labeling to refine the descriptor coding procedure. It is achieved by learning an attribute-to-word co-occurrence prior to impose a label inconsistency distortion over the 1 based coding regularizer, such that the descriptor codes can maximally preserve the image semantic similarity. Second, we propose an adaptive feature pooling scheme over superpixels rather than over fixed spatial pyramids, named Geometric Consistency Pooling (GCP). As an effect, local descriptors enjoying good geometric consistency are pooled together to ensure a more precise spatial layouts embedding in BoF. Both of our phases are unsupervised, which differ from the existing works in supervised dictionary learning, sparse coding and feature pooling. Therefore, our approach enables potential applications like scalable visual search. We evaluate in both image classification and search benchmarks and report good improvements over the state-of-the-arts. © 2012 IEEE.
Li, Z, Yang, Y, Liu, J, Zhou, X & Lu, H 2012, 'Unsupervised feature selection using nonnegative spectral analysis', Proceedings of the National Conference on Artificial Intelligence, pp. 1026-1032.
In this paper, a new unsupervised learning algorithm, namely Nonnegative Discriminative Feature Selection (NDFS), is proposed. To exploit the discriminative information in unsupervised scenarios, we perform spectral clustering to learn the cluster labels of the input samples, during which the feature selection is performed simultaneously. The joint learning of the cluster labels and feature selection matrix enables NDFS to select the most discriminative features. To learn more accurate cluster labels, a nonnegative constraint is explicitly imposed to the class indicators. To reduce the redundant or even noisy features, ℓ 2,1-norm minimization constraint is added into the objective function, which guarantees the feature selection matrix sparse in rows. Our algorithm exploits the discriminative information and feature correlation simultaneously to select a better feature subset. A simple yet efficient iterative algorithm is designed to optimize the proposed objective function. Experimental results on different real world datasets demonstrate the encouraging performance of our algorithm over the state-of-the-arts. Copyright © 2012, Association for the Advancement of Artificial Intelligence. All rights reserved.
Ma, Z, Yang, Y, Cai, Y, Sebe, N & Hauptmann, AG 2012, 'Knowledge adaptation for ad hoc multimedia event detection with few exemplars', MM 2012 - Proceedings of the 20th ACM International Conference on Multimedia, ACM International Conference on Multimedia, IEEE, Institute of Electrical and Electronics Engineers, Nara, Japan, pp. 469-478.View/Download from: Publisher's site
Multimedia event detection (MED) has a significant impact on many applications. Though video concept annotation has received much research effort, video event detection remains largely unaddressed. Current research mainly focuses on sports and news event detection or abnormality detection in surveillance videos. Our research on this topic is capable of detecting more complicated and generic events. Moreover, the curse of reality, i.e., precisely labeled multimedia content is scarce, necessitates the study on how to attain respectable detection performance using only limited positive examples. Research addressing these two aforementioned issues is still in its infancy. In light of this, we explore Ad Hoc MED, which aims to detect complicated and generic events by using few positive examples. To the best of our knowledge, our work makes the first attempt on this topic. As the information from these few positive examples is limited, we propose to infer knowledge from other multimedia resources to facilitate event detection. Experiments are performed on real-world multimedia archives consisting of several challenging events. The results show that our approach outperforms several other detection algorithms. Most notably, our algorithm outperforms SVM by 43% and 14% comparatively in Average Precision when using Gaussian and χ 2 kernel respectively. © 2012 ACM.
Ma, Z, Yang, Y, Hauptmann, AG & Sebe, N 2012, 'Classifier-specific intermediate representation for multimedia tasks', Proceedings of the 2nd ACM International Conference on Multimedia Retrieval, ICMR 2012, ACM International Conference on Multimedia Retrieval, Association for Computing Machinery (ACM), Hong Kong, Hong Kong.View/Download from: Publisher's site
Video annotation and multimedia classification play important roles in many applications such as video indexing and retrieval. To improve video annotation and event detection, researchers have proposed using intermediate concept classifiers with concept lexica to help understand the videos. Yet it is difficult to judge how many and what concepts would be sufficient for the particular video analysis task. Additionally, obtaining robust semantic concept classifiers requires a large number of positive training examples, which in turn has high human annotation cost. In this paper, we propose an approach that is able to automatically learn an intermediate representation from video features together with a classifier. The joint optimization of the two components makes them mutually beneficial and reciprocal. Effectively, the intermediate representation and the classifier are tightly correlated. The classifier dependent intermediate representation not only accurately reflects the task semantics but is also more suitable for the specific classifier. Thus we have created a discriminative semantic analysis framework based on a tightly-coupled intermediate representation. Several experiments on video annotation and multimedia event detection using real-world videos demonstrate the effectiveness of the proposed approach. Copyright © 2012 ACM.
Wang, S, Yang, Y, Ma, Z, Li, X, Pang, C & Hauptmann, AG 2012, 'Action recognition by exploring data distribution and feature correlation', Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Providence, RI, USA, pp. 1370-1377.View/Download from: Publisher's site
Human action recognition in videos draws strong research interest in computer vision because of its promising applications for video surveillance, video annotation, interactive gaming, etc. However, the amount of video data containing human actions is increasing exponentially, which makes the management of these resources a challenging task. Given a database with huge volumes of unlabeled videos, it is prohibitive to manually assign specific action types to these videos. Considering that it is much easier to obtain a small number of labeled videos, a practical solution for organizing them is to build a mechanism which is able to conduct action annotation automatically by leveraging the limited labeled videos. Motivated by this intuition, we propose an automatic video annotation algorithm by integrating semi-supervised learning and shared structure analysis into a joint framework for human action recognition. We apply our algorithm on both synthetic and realistic video datasets, including KTH , CareMedia dataset  , Youtube action  and its extended version, UCF50  . Extensive experiments demonstrate that the proposed algorithm outperforms the compared algorithms for action recognition. Most notably, our method has a very distinct advantage over other compared algorithms when we have only a few labeled samples. © 2012 IEEE.
Yang, Y & Bao, C 2012, 'A novel discriminant locality preserving projections for MDM-based speaker classification', 3rd Global Congress on Intelligent Systems (GCIS) 2012, WRI Global Congress on Intelligent Systems (GCIS), IEEE, Wuhan, China, pp. 127-130.View/Download from: Publisher's site
Speaker classification is an important component for audio indexing technology for many applications such as multimedia conferencing. The primary input device of NIST speaker classification evaluation is Multiple Distant Microphones (MDM). MDM is composed of multiple microphones and has the merit of low price and easy-to-use. The spatial time-delay vector of MDM can be extracted as the speaker's discriminant feature. However the feature dimension will be expanded quickly with the increasing number of sensors. Locality Preserving Projections (LPP) and Discriminant locality preserving projection (DLPP) are the principal manifold dimension-reduction algorithms being proposed recently. In this paper, we proposed a novel method to overcome the drawbacks of traditional manifold algorithms such as the lack of class information or spatial identification information. Some basic concepts of spatial time-delay feature and merging feature for MDM speaker classification are first introduced. A review of known DLPP algorithm followed by Fisher criterion is given. Then the Multi-component Discriminant Locality Preserving Projections (MDLPP) method for speaker classification with MDM is described. Comparative experiment results on real meeting data showed the effectiveness of the proposed method.
Yang, Y, Hauptmann, A, Chen, MY, Cai, Y, Bharucha, A & Wactlar, H 2012, 'Learning to predict health status of geriatric patients from observational data', 2012 IEEE Symposium on Computational Intelligence and Computational Biology, CIBCB 2012, IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology, IEEE, Institute of Electrical and Electronics Engineers, San Diego, CA, USA, pp. 127-134.View/Download from: Publisher's site
Data for diagnosis and clinical studies are now typically gathered by hand. While more detailed, exhaustive behavioral assessments scales have been developed, they have the drawback of being too time consuming and manual assessment can be subjective. Besides, clinical knowledge is required for accurate manual assessment, for which extensive training is needed. Therefore our great research challenge is to leverage machine learning techniques to better understand patients health status automatically based on continuous computer observations. In this paper, we study the problem of health status prediction for geriatric patients using observational data. In the first part of this paper, we propose a distance metric learning algorithm to learn a Mahalanobis distance which is more precise for similarity measures. In the second part, we propose a robust classifier based on ℓ 2,1 -norm regression to predict the geriatric patients' health status. We test the algorithm on a dataset collected from a nursing home. Experiment shows that our algorithm achieves encouraging performance. © 2012 IEEE.
Yang, Y, Yang, Y, Huang, Z, Liu, J & Ma, Z 2012, 'Robust cross-media transfer for visual event detection', MM 2012 - Proceedings of the 20th ACM International Conference on Multimedia, ACM International Conference on Multimedia, Association for Computing Machinery (ACM), Nara, Japan, pp. 1045-1048.View/Download from: Publisher's site
In this paper, we present a novel approach, named Robust Cross-Media Transfer (RCMT), for visual event detection in social multimedia environments. Different from most existing methods, the proposed method can directly take different types of noisy social multimedia data as input and conduct robust event detection. More specifically, we build a robust model by employing an l 2,1-norm regression model featuring noise tolerance, and also manage to integrate different types of social multimedia data by minimizing the distribution difference among them. Experimental results on real-life Flickr image dataset and YouTube video dataset demonstrate the effectiveness of our proposal, compared to state-of-the-art algorithms. © 2012 ACM.
Yu, SI, Xu, Z, Ding, D, Sze, W, Vicente, F, Lan, Z, Cai, Y, Rawat, S, Schulam, P, Markandaiah, N, Bahmani, S, Juarez, A, Tong, W, Yang, Y, Burger, S, Metze, F, Singh, R, Raj, B, Stern, R, Mitamura, T, Nyberg, E & Hauptmann, A 2012, 'Informedia e-lamp @ TRECVID 2012 multimedia event detection and recounting MED and MER', 2012 TREC Video Retrieval Evaluation Notebook Papers.
We report on our system used in the TRECVID 2012 Multimedia Event Detection (MED) and Multimedia Event Recounting (MER) tasks. For MED, it consists of three main steps: extracting features, training detectors and fusion. In the feature extraction part, we extract many low-level, high-level, and text features. Those features are then represented in three different ways which are spatial bag-of words with standard tiling, spatial bag-of-words with feature and event specific tiling and the Gaussian Mixture Model Super Vector. In the detector training and fusion, two classifiers and three fusion methods are employed. The results from both the official sources and our internal evaluations show good performance of our system. Our MER system utilizes a subset of features and detection results from the MED system from which the recounting is generated.
Ma, Z, Yang, Y, Nie, F, Uijlings, J & Sebe, N 2011, 'Exploiting the entire feature space with sparsity for automatic image annotation', MM'11 - Proceedings of the 2011 ACM Multimedia Conference and Co-Located Workshops, ACM International Conference on Multimedia, ACM, Scottsdale, Arizona, USA, pp. 283-292.View/Download from: Publisher's site
The explosive growth of digital images requires effective methods to manage these images. Among various existing methods, automatic image annotation has proved to be an important technique for image management tasks, e.g., image retrieval over large-scale image databases. Automatic image annotation has been widely studied during recent years and a considerable number of approaches have been proposed. However, the performance of these methods is yet to be satisfactory, thus demanding more effort on research of image annotation. In this paper, we propose a novel semi-supervised framework built upon feature selection for automatic image annotation. Our method aims to jointly select the most relevant features from all the data points by using a sparsity-based model and exploiting both labeled and unlabeled data to learn the manifold structure. Our framework is able to simultaneously learn a robust classifier for image annotation by selecting the discriminating features related to the semantic concepts. To solve the objective function of our framework, we propose an efficient iterative algorithm. Extensive experiments are performed on different realworld image datasets with the results demonstrating the promising performance of our framework for automatic image annotation. © 2011 ACM.
Shen, HT, Shao, J, Huang, Z, Yang, Y, Song, J, Liu, J & Zhu, X 2011, 'UQMSG experiments for TRECVID 2011', 2011 TREC Video Retrieval Evaluation Notebook Papers.
This paper describes the experimental framework of the University of Queensland's Multimedia Search Group (UQMSG) at TRECVID 2011. We participated in two tasks this year, both for the first time. For the semantic indexing task, we submitted four lite runs: L_A_UQMSG1_1, L_A_UQMSG2_2, L_A_UQMSG3_3 and L_A_UQMSG4_4. They are all of training type A (actually we only used IACC.1.tv10.training data), but with different parameter settings in our keyframe-based Laplacian Joint Group Lasso (LJGL) algorithm with Local Binary Patterns (LBP) feature. For the content-based copy detection task, we submitted two runs: UQMSG.m.nofa.mfh and UQMSG.m.balanced.mfh. They used only the video modality information of keyframes and were both based on our Multiple Feature Hashing (MFH) algorithm that fuses local (LBP) and global (HSV) visual features, with different application profiles (reducing the false alarm rate v.s. balancing false alarms and misses). Due to time constraint, we were not able to improve the performance of our systems adequately on all the available training data this year for these tasks. Evaluation results suggest that more efforts need to be made to well tune system parameters. In addition, sophisticated techniques beyond applying keyframe-level semantic concept propagation and near-duplicate detection are required for achieving better performance in video tasks.
Song, J, Yang, Y, Huang, Z, Shen, HT & Hong, R 2011, 'Multiple feature hashing for real-time large scale near-duplicate video retrieval', MM'11 - Proceedings of the 2011 ACM Multimedia Conference and Co-Located Workshops, ACM International Conference on Multimedia, ACM, Scottsdale, Arizona, USA, pp. 423-432.View/Download from: Publisher's site
Near-duplicate video retrieval (NDVR) has recently attracted lots of research attention due to the exponential growth of online videos. It helps in many areas, such as copyright protection, video tagging, online video usage monitoring, etc. Most of existing approaches use only a single feature to represent a video for NDVR. However, a single feature is often insufficient to characterize the video content. Besides, while the accuracy is the main concern in previous literatures, the scalability of NDVR algorithms for large scale video datasets has been rarely addressed. In this paper, we present a novel approach - Multiple Feature Hashing (MFH) to tackle both the accuracy and the scalability issues of NDVR. MFH preserves the local structure information of each individual feature and also globally consider the local structures for all the features to learn a group of hash functions which map the video keyframes into the Hamming space and generate a series of binary codes to represent the video dataset. We evaluate our approach on a public video dataset and a large scale video dataset consisting of 132,647 videos, which was collected from YouTube by ourselves. The experiment results show that the proposed method outperforms the state-of-the-art techniques in both accuracy and efficiency. © 2011 ACM.
Wang, H, Nie, F, Huang, H & Yang, Y 2011, 'Learning frame relevance for video classification', MM'11 - Proceedings of the 2011 ACM Multimedia Conference and Co-Located Workshops, ACM International Conference on Multimedia, ACM, Scottsdale, Arizona, USA, pp. 1345-1348.View/Download from: Publisher's site
Traditional video classification methods typically require a large number of labeled training video frames to achieve satisfactory performance. However, in the real world, we usually only have sufficient labeled video clips (such as tagged online videos) but lack labeled video frames. In this paper, we formalize the video classification problem as a Multi-Instance Learning (MIL) problem, an emerging topic in machine learning in recent years, which only needs bag (video clip) labels. To solve the problem, we propose a novel Parameterized Class-to-Bag (P-C2B) Distance method to learn the relative importance of a training instance with respect to its labeled classes, such that the instance level labeling ambiguity in MIL is tackled and the frame relevances of training video data with respect to the semantic concepts of interest are given. Promising experimental results have demonstrated the effectiveness of the proposed method. Copyright 2011 ACM.
Yang, Y, Shen, HT, Ma, Z, Huang, Z & Zhou, X 2011, 'L2,1-Norm Regularized Discriminative FeatureSelection for Unsupervised Learning', 22nd IJCAI International Joint Conference on Artificial Intelligence, International Joint Conference on Artificial Intelligence, AAAI Press, Barcelona, Catelonia, pp. 1589-1594.View/Download from: Publisher's site
Compared with supervised learning for feature selection, it is much more difficult to select the discriminative features in unsupervised learning due to the lack of label information. Traditional unsupervised feature selection algorithms usually select the features which best preserve the data distribution, e.g., manifold structure, of the whole feature set. Under the assumption that the class label of input data can be predicted by a linear classifier, we incorporate discriminative analysis and ℓ 2,1 -norm minimization into a joint framework for unsupervised feature selection. Different from existing unsupervised feature selection algorithms, our algorithm selects the most discriminative feature subset from the whole feature set in batch mode. Extensive experiment on different data types demonstrates the effectiveness of our algorithm.
Yang, Y, Shen, HT, Nie, F, Ji, R & Zhou, X 2011, 'Nonnegative spectral clustering with discriminative regularization', Proceedings of the Twenty-Fifth AAAI Conference on Artificial Intelligence, AAAI Conference on Artificial Intelligence, pp. 555-560.
Clustering is a fundamental research topic in the field of data mining. Optimizing the objective functions of clustering algorithms, e.g. normalized cut and k-means, is an NP-hard optimization problem. Existing algorithms usually relax the elements of cluster indicator matrix from discrete values to continuous ones. Eigenvalue decomposition is then performed to obtain a relaxed continuous solution, which must be discretized. The main problem is that the signs of the relaxed continuous solution are mixed. Such results may deviate severely from the true solution, making it a nontrivial task to get the cluster labels. To address the problem, we impose an explicit nonnegative constraint for a more accurate solution during the relaxation. Besides, we additionally introduce a discriminative regularization into the objective to avoid overfitting. A new iterative approach is proposed to optimize the objective. We show that the algorithm is a general one which naturally leads to other extensions. Experiments demonstrate the effectiveness of our algorithm. Copyright © 2011, Association for the Advancement of Artificial Intelligence. All rights reserved.
Yang, Y, Yang, Y, Huang, Z & Shen, HT 2011, 'Transfer tagging from image to video', MM'11 - Proceedings of the 2011 ACM Multimedia Conference and Co-Located Workshops, ACM International Conference on Multimedia, ACM, Scottsdale, Arizona, USA, pp. 1137-1140.View/Download from: Publisher's site
Nowadays massive amount of web video datum has been emerging on the Internet. To achieve an effective and efficient video retrieval, it is critical to automatically assign semantic keywords to the videos via content analysis. However, most of the existing video tagging methods suffer from the problem of lacking sufficient tagged training videos due to high labor cost of manual tagging. Inspired by the observation that there are much more well-labeled data in other yet relevant types of media (e.g. images), in this paper we study how to build a "cross-media tunnel" to transfer external tag knowledge from image to video. Meanwhile, the intrinsic data structures of both image and video spaces are well explored for inferring tags. We propose a Cross-Media Tag Transfer (CMTT) paradigm which is able to: 1) transfer tag knowledge between image and video by minimizing their distribution difference; 2) infer tags by revealing the underlying manifold structures embedded within both image and video spaces. We also learn an explicit mapping function to handle unseen videos. Experimental results have been reported and analyzed to illustrate the superiority of our proposal. Copyright 2011 ACM.
Yang, Y, Yang, Y, Huang, Z, Shen, HT & Nie, F 2011, 'Tag localization with spatial correlations and joint group sparsity', Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Colorado Springs, CO, USA, USA, pp. 881-888.View/Download from: Publisher's site
Nowadays numerous social images have been emerging on the Web. How to precisely label these images is critical to image retrieval. However, traditional image-level tagging methods may become less effective because global image matching approaches can hardly cope with the diversity and arbitrariness of Web image content. This raises an urgent need for the fine-grained tagging schemes. In this work, we study how to establish mapping between tags and image regions, i.e. localize tags to image regions, so as to better depict and index the content of images. We propose the spatial group sparse coding (SGSC) by extending the robust encoding ability of group sparse coding with spatial correlations among training regions. We present spatial correlations in a two-dimensional image space and design group-specific spatial kernels to produce a more interpretable regularizer. Further we propose a joint version of the SGSC model which is able to simultaneously encode a group of intrinsically related regions within a test image. An effective algorithm is developed to optimize the objective function of the Joint SGSC. The tag localization task is conducted by propagating tags from sparsely selected groups of regions to the target regions according to the reconstruction coefficients. Extensive experiments on three public image datasets illustrate that our proposed models achieve great performance improvements over the state-of-the-art method in the tag localization task. © 2011 IEEE.
Yang, Y, Nie, F, Xiang, S, Zhuang, Y & Wang, W 2010, 'Local and Global Regressive Mapping for Manifold Learning with Out-of-Sample Extrapolation', PROCEEDINGS OF THE TWENTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE (AAAI-10), 24th AAAI Conference on Artificial Intelligence (AAAI), ASSOC ADVANCEMENT ARTIFICIAL INTELLIGENCE, Atlanta, GA, pp. 649-654.
Yang, Y, Xu, Nie, Luo & Zhuang 2009, 'Ranking with Local Regression and Global Alignment for Cross Media Retrieval', ACM Multimedia.
Yang, Y, Zhuang, Y, Xu, D, Pan, Y, Tao, D & Maybank, S 2009, 'Retrieval Based Interactive Cartoon Synthesis via Unsupervised Bi-Distance Metric Learning', 2009 ACM International Conference on Multimedia Compilation E-Proceedings (with co-located workshops & symposiums), ACM international conference on Multimedia, Association for Computing Machinery, Inc. (ACM), Beijing, China, pp. 311-320.View/Download from: Publisher's site
Cartoons play important roles in many areas, but it requires a lot of labor to produce new cartoon clips. In this paper, we propose a gesture recognition method for cartoon character images with two applications, namely content-based cartoon image retrieval and cartoon clip synthesis. We first define Edge Features (EF) and Motion Direction Features (MDF) for cartoon character images. The features are classified into two different groups, namely intra-features and inter-features. An Unsupervised Bi-Distance Metric Learning (UBDML) algorithm is proposed to recognize the gestures of cartoon character images. Different from the previous research efforts on distance metric learning, UBDML learns the optimal distance metric from the heterogeneous distance metrics derived from intra-features and inter-features. Content-based cartoon character image retrieval and cartoon clip synthesis can be carried out based on the distance metric learned by UBDML. Experiments show that the cartoon character image retrieval has a high precision and that the cartoon clip synthesis can be carried out efficiently.
Yang, Y, Zhuang & Wang 2008, 'Heterogeneous Multimedia Data Semantics Mining using Content and Location Context', ACM Multimedia.
Zhuang, Y & Yang, Y 2007, 'Boosting cross-media retrieval by learning with positive and negative examples', ADVANCES IN MULTIMEDIA MODELING, PT 2, 13th International Multimedia Modeling Conference (MMM 2007), SPRINGER-VERLAG BERLIN, Singapore, SINGAPORE, pp. 165-+.
Wu, F, Yang, Y, Zhuang, YT & Pan, YH 2005, 'Understanding multimedia document semantics for cross-media retrieval', ADVANCES IN MULTIMEDIA INFORMATION PROCESSING - PCM 2005, PT 1, 6th Pacific-Rim Conference on Multimedia (PCM 2005), SPRINGER-VERLAG BERLIN, Cheju Isl, SOUTH KOREA, pp. 993-1004.
Hauptmann, AG, Yang, Y & Zheng, L 2016, 'Person Re-identification: Past, Present and Future'.
Person re-identification (re-ID) has become increasingly popular in the community due to its application and research significance. It aims at spotting a person of interest in other cameras. In the early days, hand-crafted algorithms and small-scale evaluation were predominantly reported. Recent years have witnessed the emergence of large-scale datasets and deep learning systems which make use of large data volumes. Considering different tasks, we classify most current re-ID methods into two classes, i.e., image-based and video-based; in both tasks, hand-crafted and deep learning systems will be reviewed. Moreover, two new re-ID tasks which are much closer to real-world applications are described and discussed, i.e., end-to-end re-ID and fast re-ID in very large galleries. This paper: 1) introduces the history of person re-ID and its relationship with image classification and instance retrieval; 2) surveys a broad selection of the hand-crafted systems and the large-scale methods in both image- and video-based re-ID; 3) describes critical future directions in end-to-end re-ID and fast retrieval in large galleries; and 4) finally briefs some important yet under-developed issues.
Chang, X, Nie, F, Ma, Z & Yang, Y, 'Balanced k-Means and Min-Cut Clustering'.
Clustering is an effective technique in data mining to generate groups that
are the matter of interest. Among various clustering approaches, the family of
k-means algorithms and min-cut algorithms gain most popularity due to their
simplicity and efficacy. The classical k-means algorithm partitions a number of
data points into several subsets by iteratively updating the clustering centers
and the associated data points. By contrast, a weighted undirected graph is
constructed in min-cut algorithms which partition the vertices of the graph
into two sets. However, existing clustering algorithms tend to cluster minority
of data points into a subset, which shall be avoided when the target dataset is
balanced. To achieve more accurate clustering for balanced dataset, we propose
to leverage exclusive lasso on k-means and min-cut to regulate the balance
degree of the clustering results. By optimizing our objective functions that
build atop the exclusive lasso, we can make the clustering result as much
balanced as possible. Extensive experiments on several large-scale datasets
validate the advantage of the proposed algorithms compared to the
state-of-the-art clustering algorithms.
Chang, X, Nie, F, Yang, Y & Huang, H, 'Improved Spectral Clustering via Embedded Label Propagation'.
Spectral clustering is a key research topic in the field of machine learning
and data mining. Most of the existing spectral clustering algorithms are built
upon Gaussian Laplacian matrices, which are sensitive to parameters. We propose
a novel parameter free, distance consistent Locally Linear Embedding. The
proposed distance consistent LLE promises that edges between closer data points
have greater weight.Furthermore, we propose a novel improved spectral
clustering via embedded label propagation. Our algorithm is built upon two
advancements of the state of the art:1) label propagation,which propagates a
node\'s labels to neighboring nodes according to their proximity; and 2)
manifold learning, which has been widely used in its capacity to leverage the
manifold structure of data points. First we perform standard spectral
clustering on original data and assign each cluster to k nearest data points.
Next, we propagate labels through dense, unlabeled data regions. Extensive
experiments with various datasets validate the superiority of the proposed
algorithm compared to current state of the art spectral algorithms.