UTS site search

Professor Yi Yang

Biography

Yi Yang is a professor with the Faculty of Engineering and Information Technology, University of Technology Sydney (UTS). I am also taking the role of Deputy Head of School (Research) of the School of Software at UTS.  I received my PhD degree in Computer Science from Zhejiang University in 2010. I was a postdoc researcher at the School of Computer Science, Carnegie Mellon University before I came to Australia.

Professional

At the early years when  I was working within the DCD group directed by Prof Yunhe Pan and Prof Yueting Zhuang at Zhejiang University, I designed and developed a few intelligent multimedia analysis algorithms and tools. Part of the outputs was recognized by the national outstanding PhD thesis, which was given by the Ministry of Education of China. From 2011, together with my colleagues at the Informedia group directed by Alex Hauptmann, I have spent most of my time on combing multiple signals (visual, acoustic, textual) for automated analysis of Internet quality videos. The systems that we have developed have achieved top performance in various international competitions, such as the TRECVID MED and the TRECVID SED. Our system of multi-camera person tracking and identification was elected as one of the 13 incredible Tech innovations by the Huffington post. 
While continuously working on intelligent video analysis and multi-sensor signal processing, I expanded my research interests after I joined UTS in January 2015.  Our current research interests and projects span almost every aspect of multimodal signal processing, computer vision, text processing, and pattern recognition. I have been very fortunate to work with younger talents on various research topics in these fields. Here is a bit information about my research team and collaborators. Here is a link to my publications. In addition to the  research papers, the system our group has developed achieved the best performance in a few competitions, including TRECVID LOC, the Thumos Action Recognition challenge, the MSR-Bing Image retrieval grand challenge, etc. I was recently awarded by the Google Faculty Research Award in recognition of my proposal to efficient video analysis. 
Professor, A/DRsch Centre for Artificial Intelligence
Core Member, Centre for Artificial Intelligence
Computer Science and Technology, Computer Science and Technology
 
Phone
+61 2 9514 9821

Research Interests

Computer Vision, Machine Learning, Multimedia 
Can supervise: Yes

Chapters

Chen, M.Y., Hauptmann, A., Bharucha, A., Wactlar, H. & Yang, Y. 2013, 'Human activity analysis for geriatric care in nursing homes' in The Era of Interactive Media, pp. 53-61.
View/Download from: Publisher's site
© 2013 Springer Science+Business Media, LLC. All rights reserved. As our society is increasingly aging, it is urgent to develop computer aided techniques to improve the quality-of-care (QoC) and quality-of-life (QoL) of geriatric patients. In this paper, we focus on automatic human activities analysis in video surveillance recorded in complicated environments at a nursing home. This will enable the automatic exploration of the statistical patterns between patients' daily activities and their clinical diagnosis. We also discuss potential future research directions in this area. Experiment demonstrate the proposed approach is effective for human activity analysis.

Conferences

Yan, Y., Yang, T., Yang, Y. & Chen, J. 2017, 'A Framework of Online Learning with Imbalanced Streaming Data', AAAI.
Pan, P., Xu, Z., Yang, Y., Wu, F. & Zhuang, Y. 2016, 'Hierarchical recurrent neural encoder for video representation with application to captioning', Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, pp. 1029-1038.
Recently, deep learning approach, especially deep Convolutional Neural Networks (ConvNets), have achieved overwhelming accuracy with fast processing speed for image classification. Incorporating temporal structure with deep ConvNets for video representation becomes a fundamental problem for video content analysis. In this paper, we propose a new approach, namely Hierarchical Recurrent Neural Encoder (HRNE), to exploit temporal information of videos. Compared to recent video representation inference approaches, this paper makes the following three contributions. First, our HRNE is able to efficiently exploit video temporal structure in a longer range by reducing the length of input information flow, and compositing multiple consecutive inputs at a higher level. Second, computation operations are significantly lessened while attaining more nonlinearity. Third, HRNE is able to uncover temporal transitions between frame chunks with different granularities, i.e. it can model the temporal transitions between frames as well as the transitions between segments. We apply the new method to video captioning where temporal information plays a crucial role. Experiments demonstrate that our method outperforms the state-of-the-art on video captioning benchmarks.
Chang, X., Yang, Y., Long, G., Zhang, C. & Hauptmann, A.G. 2016, 'Dynamic concept composition for zero-example event detection', Proceedings of 30th AAAI Conference on Artificial Intelligence, AAAI 2016, Conference on Artificial Intelligence (AAAI), Association for the Advancement of Artificial Intelligence, Phoenix, Arizona, United States, pp. 3464-3470.
View/Download from: UTS OPUS
© Copyright 2016, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.In this paper, we focus on automatically detecting events in unconstrained videos without the use of any visual training exemplars. In principle, zero-shot learning makes it possible to train an event detection model based on the assumption that events (e.g. birthday party) can be described by multiple mid-level semantic concepts (e.g. "blowing candle", "birthday cake"). Towards this goal, we first pre-Train a bundle of concept classifiers using data from other sources. Then we evaluate the semantic correlation of each concept w.r.t. the event of interest and pick up the relevant concept classifiers, which are applied on all test videos to get multiple prediction score vectors. While most existing systems combine the predictions of the concept classifiers with fixed weights, we propose to learn the optimal weights of the concept classifiers for each testing video by exploring a set of online available videos with freeform text descriptions of their content. To validate the effectiveness of the proposed approach, we have conducted extensive experiments on the latest TRECVID MEDTest 2014, MEDTest 2013 and CCV dataset. The experimental results confirm the superiority of the proposed approach.
Yan, Y., Xu, Z., Tsang, W., Long, G. & Yang, Y. 2016, 'Robust Semi-supervised Learning through Label Aggregation', Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16), AAAI Conference on Artificial Intelligence, AAAI, Phoenix, USA, pp. 2244-2250.
View/Download from: UTS OPUS
Semi-supervised learning is proposed to exploit both labeled and unlabeled data. However, as the scale of data in real world applications increases significantly, conventional semisupervised algorithms usually lead to massive computational cost and cannot be applied to large scale datasets. In addition, label noise is usually present in the practical applications due to human annotation, which very likely results in remarkable degeneration of performance in semi-supervised methods. To address these two challenges, in this paper, we propose an efficient RObust Semi-Supervised Ensemble Learning (ROSSEL) method, which generates pseudo-labels for unlabeled data using a set of weak annotators, and combines them to approximate the ground-truth labels to assist semisupervised learning. We formulate the weighted combination process as a multiple label kernel learning (MLKL) problem which can be solved efficiently. Compared with other semisupervised learning algorithms, the proposed method has linear time complexity. Extensive experiments on five benchmark datasets demonstrate the superior effectiveness, effi- ciency and robustness of the proposed algorithm.
Chang, X., Yu, Y.L., Yang, Y. & Xing, E.P. 2016, 'They are not equally reliable: Semantic event search using differentiated concept classifiers', Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Computer Vision and Pattern Recognition (CVPR), IEEE, Las Vegas, Nevada, United States, pp. 1884-1893.
View/Download from: UTS OPUS or Publisher's site
Complex event detection on unconstrained Internet videos has seen much progress in recent years. However, state-of-the-art performance degrades dramatically when the number of positive training exemplars falls short. Since label acquisition is costly, laborious, and time-consuming, there is a real need to consider the much more challenging semantic event search problem, where no example video is given. In this paper, we present a state-of-the-art event search system without any example videos. Relying on the key observation that events (e.g. dog show) are usually compositions of multiple mid-level concepts (e.g. "dog," "theater," and "dog jumping"), we first train a skip-gram model to measure the relevance of each concept with the event of interest. The relevant concept classifiers then cast votes on the test videos but their reliability, due to lack of labeled training videos, has been largely unaddressed. We propose to combine the concept classifiers based on a principled estimate of their accuracy on the unlabeled test videos. A novel warping technique is proposed to improve the performance and an efficient highly-scalable algorithm is provided to quickly solve the resulting optimization. We conduct extensive experiments on the latest TRECVID MEDTest 2014, MEDTest 2013 and CCV datasets, and achieve state-of-the-art performances.
Du, X., Yin, H., Huang, Z., Yang, Y. & Zhou, X. 2016, 'Using detected visual objects to index video database', Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Australasian Database Conference, Springer, Sydney, New South Wales, Australia, pp. 333-345.
View/Download from: UTS OPUS or Publisher's site
© Springer International Publishing AG 2016.In this paper, we focus on how to use visual objects to index the videos. Two tables are constructed for this purpose, namely the unique object table and the occurrence table. The former table stores the unique objects which appear in the videos, while the latter table stores the occurrence information of these unique objects in the videos. In previous works, these two tables are generated manually by a topdown process. That is, the unique object table is given by the experts at first, then the occurrence table is generated by the annotators according to the unique object table. Obviously, such process which heavily depends on human labors limits the scalability especially when the data are dynamic or large-scale. To improve this, we propose to perform a bottom-up process to generate these two tables. The novelties are: we use object detector instead of human annotation to create the occurrence table; we propose a hybrid method which consists of local merge, global merge and propagation to generate the unique object table and fix the occurrence table. In fact, there are another three candidate methods for implementing the bottom-up process, namely, recognizing-based, matching-based and tracking-based methods. Through analyzing their mechanism and evaluating their accuracy, we find that they are not suitable for the bottom-up process. The proposed hybrid method leverages the advantages of the matching-based and tracking-based methods. Our experiments show that the hybrid method is more accurate and efficient than the candidate methods, which indicates that it is more suitable for the proposed bottom-up process.
Luo, M., Nie, F., Chang, X., Yang, Y., Hauptmann, A. & Zheng, Q. 2016, 'Avoiding optimal mean robust PCA/2DPCA with non-greedy 1-norm maximization', IJCAI International Joint Conference on Artificial Intelligence, International Joint Conference on Artificial Intelligence (IJCAI), AAAI Press / International Joint Conferences on Artificial Intelligence, New York City, New York, United States, pp. 1802-1808.
View/Download from: UTS OPUS
Robust principal component analysis (PCA) is one of the most important dimension reduction techniques to handle high-dimensional data with outliers. However, the existing robust PCA presupposes that the mean of the data is zero and incorrectly utilizes the Euclidean distance based optimal mean for robust PCA with 1-norm. Some studies consider this issue and integrate the estimation of the optimal mean into the dimension reduction objective, which leads to expensive computation. In this paper, we equivalently reformulate the maximization of variances for robust PCA, such that the optimal projection directions are learned by maximizing the sum of the projected difference between each pair of instances, rather than the difference between each instance and the mean of the data. Based on this reformulation, we propose a novel robust PCA to automatically avoid the calculation of the optimal mean based on 1-norm distance. This strategy also makes the assumption of centered data unnecessary. Additionally, we intuitively extend the proposed robust PCA to its 2D version for image recognition. Efficient non-greedy algorithms are exploited to solve the proposed robust PCA and 2D robust PCA with fast convergence and low computational complexity. Some experimental results on benchmark data sets demonstrate the effectiveness and superiority of the proposed approaches on image reconstruction and recognition.
Yang, Y., Gan, C., Lin, M., de Molo, G. & Hauptmann, A.G. 2016, 'Concepts not alone: Exploring pairwise relationships for zero-shot video activity recognition', Website Proceedings of 2016 AAAI Conference on Artificial Intelligence, AAAI, Thirtieth AAAI Conference on Artificial Intelligence, AAAI, AAAI, Phoenix, Arizona, pp. 3487-3493.
View/Download from: UTS OPUS
Vast quantities of videos are now being captured at astonishing rates, but the majority of these are not labelled. To cope with such data, we consider the task of content-based activity recognition in videos without any manually labelled examples, also known as zero-shot video recognition. To achieve this, videos are represented in terms of detected visual concepts, which are then scored as relevant or irrelevant according to their similarity with a given textual query. In this paper, we propose a more robust approach for scoring concepts in order to alleviate many of the brittleness and low precision problems of previous work. Not only do we jointly consider semantic relatedness, visual reliability, and discriminative power. To handle noise and non-linearities in the ranking scores of the selected concepts, we propose a novel pairwise order matrix approach for score aggregation. Extensive experiments on the large-scale TRECVID Multimedia Event Detection data show the superiority of our approach.
Yang, Y., Gan, C., Yao, T., Yang, K. & Mei, T. 2016, 'You lead, we exceed: Labor-free video concept learning by jointly exploiting web videos and images', Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on, Computer Vision and Pattern Recognition (CVPR), IEEE, Las Vegas, pp. 923-932.
View/Download from: UTS OPUS
Chang, X.J., Nie, F.P., Ma, Z.G., Yang, Y. & Zhou, X.F. 2015, 'A Convex Formulation for Spectral Shrunk Clustering', Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, AAAI Conference on Artificial Intelligence, AAAI, Austin Texas, USA, pp. 2532-2538.
Spectral clustering is a fundamental technique in the field of data mining and information processing. Most existing spectral clustering algorithms integrate dimensionality reduction into the clustering process assisted by manifold learning in the original space. However, the manifold in reduced-dimensional subspace is likely to exhibit altered properties in contrast with the original space. Thus, applying manifold information obtained from the original space to the clustering process in a low-dimensional subspace is prone to inferior performance. Aiming to address this issue, we propose a novel convex algorithm that mines the manifold structure in the low-dimensional subspace. In addition, our unified learning process makes the manifold learning particularly tailored for the clustering. Compared with other related methods, the proposed algorithm results in more structured clustering result. To validate the efficacy of the proposed algorithm, we perform extensive experiments on several benchmark datasets in comparison with some state-of-the-art clustering approaches. The experimental results demonstrate that the proposed algorithm has quite promising clustering performance
Yan, Y., Yang, Y., Shen, H., Meng, D., Liu, G.W., Hauptmann, A.G. & Sebe, N. 2015, 'Complex Event Detection via Event Oriented Dictionary Learning', Proceedings of the 29th AAAI Conference on Artificial Intelligence, AAAI Conference on Artificial Intelligence, AAAI, Austin, USA, pp. 3841-3847.
Complex event detection is a retrieval task with the goal of finding videos of a particular event in a large-scale unconstrained internet video archive, given example videos and text descriptions. Nowadays, different multimodal fusion schemes of low-level and high-level features are extensively investigated and evaluated for the complex event detection task. However, how to effectively select the high-level semantic meaningful concepts from a large pool to assist complex event detection is rarely studied in the literature. In this paper, we propose two novel strategies to automatically select semantic meaningful concepts for the event detection task based on both the events-kit text descriptions and the concepts high-level feature descriptions. Moreover, we introduce a novel event oriented dictionary representation based on the selected semantic concepts. Towards this goal, we leverage training samples of selected concepts from the Semantic Indexing (SIN) dataset with a pool of 346 concepts, into a novel supervised multi-task dictionary learning framework. Extensive experimental results on TRECVID Multimedia Event Detection (MED) dataset demonstrate the efficacy of our proposed method.
Gan, C., Lin, M., Yang, Y., Zhuang, Y.T. & Hauptmann, A.G. 2015, 'Exploring Semantic Inter-Class Relationships (SIR) for Zero-Shot Action Recognition', Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, AAAI Conference on Artificial Intelligence, AAAI, Austin, USA, pp. 3769-3775.
Automatically recognizing a large number of action categories from videos is of significant importance for video understanding. Most existing works focused on the design of more discriminative feature representation, and have achieved promising results when the positive samples are enough. However, very limited efforts were spent on recognizing a novel action without any positive exemplars, which is often the case in the real settings due to the large amount of action classes and the users' queries dramatic variations. To address this issue, we propose to perform action recognition when no positive exemplars of that class are provided, which is often known as the zero-shot learning. Different from other zero-shot learning approaches, which exploit attributes as the intermediate layer for the knowledge transfer, our main contribution is SIR, which directly leverages the semantic inter-class relationships between the known and unknown actions followed by label transfer learning. The inter-class semantic relationships are automatically measured by continuous word vectors, which learned by the skip-gram model using the large-scale text corpus. Extensive experiments on the UCF101 dataset validate the superiority of our method over fully-supervised approaches using few positive exemplars
Wu, Song, Yang, Y., Li, Zhang & Zhuang 2015, 'Structured Embedding via Pairwise Relations and Long-Range Interactions in Knowledge Base', AAAI Conference on Artificial Intelligence, 2015, AAAI, USA.
We consider the problem of embedding entities and relations of knowledge bases into low-dimensional continuous vector spaces (distributed representations). Unlike most existing approaches, which are primarily efficient for modelling pairwise relations between entities, we attempt to explicitly model both pairwise relations and long-range interactions between entities, by interpreting them as linear operators on the low-dimensional embeddings of the entities. Therefore, in this paper we introduces Path-Ranking to capture the long-range interactions of knowledge graph and at the same time preserve the pairwise relations of knowledge graph; we call it 'structured embedding via pairwise relation and long-range interactions' (referred to as SePLi). Comparing with the-state-of-the-art models, SePLi achieves better performances of embeddings.
Xu, Z.W., Yang, Y. & Hauptmann, A.G. 2015, 'A Discriminative CNN Video Representation for Event Detection', 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE Conference on Computer Vision and Pattern Recognition, Boston, USA.
View/Download from: Publisher's site
In this paper, we propose a discriminative video representation for event detection over a large scale video dataset when only limited hardware resources are available. The focus of this paper is to effectively leverage deep Convolutional Neural Networks (CNNs) to advance event detection, where only frame level static descriptors can be extracted by the existing CNN toolkits. This paper makes two contributions to the inference of CNN video representation. First, while average pooling and max pooling have long been the standard approaches to aggregating frame level static features, we show that performance can be significantly improved by taking advantage of an appropriate encoding method. Second, we propose using a set of latent concept descriptors as the frame descriptor, which enriches visual information while keeping it computationally affordable. The integration of the two contributions results in a new state-of-the-art performance in event detection over the largest video datasets. Compared to improved Dense Trajectories, which has been recognized as the best video representation for event detection, our new representation improves the Mean Average Precision (mAP) from 27.6% to 36.8% for the TRECVID MEDTest 14 dataset and from 34.0% to 44.6% for the TRECVID MEDTest 13 dataset.
Gan, C., Wang, N., Yang, Y., Yeung, D.Y. & Hauptmann, A.G. 2015, 'Devnet: A Deep Event Network for Multimedia Event Detection and Evidence Recounting', Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Boston, USA, pp. 2568-2577.
In this paper, we focus on complex event detection in internet videos while also providing the key evidences of the detection results. Convolutional Neural Networks (CNNs) have achieved promising performance in image classification and action recognition tasks. However, it remains an open problem how to use CNNs for video event detection and recounting, mainly due to the complexity and diversity of video events. In this work, we propose a flexible deep CNN infrastructure, namely Deep Event Network (DevNet), that simultaneously detects pre-defined events and provides key spatial-temporal evidences. Taking key frames of videos as input, we first detect the event of interest at the video level by aggregating the CNN features of the key frames. The pieces of evidences which recount the detection results, are also automatically localized, both temporally and spatially. The challenge is that we only have video level labels, while the key evidences usually take place at the frame levels. Based on the intrinsic property of CNNs, we first generate a spatial-temporal saliency map by back passing through DevNet, which then can be used to find the key frames which are most indicative to the event, as well as to localize the specific spatial position, usually an object, in the frame of the highly indicative area. Experiments on the large scale TRECVID 2014 MEDTest dataset demonstrate the promising performance of our method, both for event detection and evidence recounting.
Yan, Y., Tan, M., Yang, Y., Tsang, I. & Zhang, C. 2015, 'Scalable maximum margin matrix factorization by active riemannian subspace search', Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 2015), IJCAI International Joint Conference on Artificial Intelligence, AAAI, Buenos Aires, Argentina, pp. 3988-3994.
View/Download from: UTS OPUS
The user ratings in recommendation systems are usually in the form of ordinal discrete values. To give more accurate prediction of such rating data, maximum margin matrix factorization (M3F) was proposed. Existing M3F algorithms, however, either have massive computational cost or require expensive model selection procedures to determine the number of latent factors (i.e. the rank of the matrix to be recovered), making them less practical for large scale data sets. To address these two challenges, in this paper, we formulate M3F with a known number of latent factors as the Riemannian optimization problem on a fixed-rank matrix manifold and present a block-wise nonlinear Riemannian conjugate gradient method to solve it ef- ficiently. We then apply a simple and efficient active subspace search scheme to automatically detect the number of latent factors. Empirical studies on both synthetic data sets and large real-world data sets demonstrate the superior efficiency and effectiveness of the proposed method.
Yang, Y., Yu, S.I., Jiang, L., Xu, Z.W. & Hauptmann, A.G. 2015, 'Content-Based Video Search over 1 Million Videos with 1 Core in 1 Second', Proceedings of the 5th ACM on International Conference on Multimedia Retrieval, 5th ACM on International Conference on Multimedia Retrieval, ACM, Shanghai, China, pp. 419-426.
View/Download from: Publisher's site
Many content-based video search (CBVS) systems have been proposed to analyze the rapidly-increasing amount of user-generated videos on the Internet. Though the accuracy of CBVS systems have drastically improved, these high accuracy systems tend to be too inefficient for interactive search. Therefore, to strive for real-time web-scale CBVS, we perform a comprehensive study on the different components in a CBVS system to understand the trade-offs between accuracy and speed of each component. Directions investigated include exploring different low-level and semantics-based features, testing different compression factors and approximations during video search, and understanding the time v.s. accuracy trade-off of reranking. Extensive experiments on data sets consisting of more than 1,000 hours of video showed that through a combination of effective features, highly compressed representations, and one iteration of reranking, our proposed system can achieve an 10,000-fold speedup while retaining 80% accuracy of a state-of-the-art CBVS system. We further performed search over 1 million videos and demonstrated that our system can complete the search in 0.975 seconds with a single core, which potentially opens the door to interactive web-scale CBVS for the general public.
Chang, X., Yang, Y., Hauptmann, A., Xing, E.P. & Yu, Y.L. 2015, 'Semantic Concept Discovery for Large-Scale Zero-Shot Event Detection', Proceedings of the 24th International Conference on Artificial Intelligence, International Joint Conference of Artificial Intelligence, ACM, Buenos Aires, Argentina, pp. 2234-2240.
We focus on detecting complex events in unconstrained Internet videos. While most existing works rely on the abundance of labeled training data, we consider a more difficult zero-shot setting where no training data is supplied. We first pre-train a number of concept classifiers using data from other sources. Then we evaluate the semantic correlation of each concept w.r.t. the event of interest. After further refinement to take prediction inaccuracy and discriminative power into account, we apply the discovered concept classifiers on all test videos and obtain multiple score vectors. These distinct score vectors are converted into pairwise comparison matrices and the nuclear norm rank aggregation framework is adopted to seek consensus. To address the challenging optimization formulation, we propose an efficient, highly scalable algorithm that is an order of magnitude faster than existing alternatives. Experiments on recent TRECVID datasets verify the superiority of the proposed approach. We focus on detecting complex events in unconstrained Internet videos. While most existing works rely on the abundance of labeled training data, we consider a more difficult zero-shot setting where no training data is supplied.We first pre-train a number of concept classifiers using data from other sources. Then we evaluate the semantic correlation of each concept w.r.t. the event of interest. After further refinement to take prediction inaccuracy and discriminative power into account, we apply the discovered concept classifiers on all test videos and obtain multiple score vectors. These distinct score vectors are converted into pairwise comparison matrices and the nuclear norm rank aggregation framework is adopted to seek consensus. To address the challenging optimization formulation, we propose an efficient, highly scalable algorithm that is an order of magnitude faster than existing alternatives. Experiments on recent TRECVID datasets verify the superiority of the proposed appr...
Chang, X., Yang, Y., Xing, E.P. & Yu, Y.-.L. 2015, 'Complex Event Detection using Semantic Saliency and Nearly-Isotonic SVM', Proceedings of The 32nd International Conference on Machine Learning, International Conference on Machine Learning, International Machine Learning Society, Lille Grand, Paris, pp. 1348-1357.
We aim to detect complex events in long Internet videos that may last for hours. A major challenge in this setting is that only a few shots in a long video are relevant to the event of interest while others are irrelevant or even misleading. Instead of indifferently pooling the shots, we first define a novel notion of semantic saliency that assesses the relevance of each shot with the event of interest. We then prioritize the shots according to their saliency scores since shots that are semantically more salient are expected to contribute more to the final event detector. Next, we propose a new isotonic regularizer that is able to exploit the semantic ordering information. The resulting nearly-isotonic SVM classifier exhibits higher discriminative power. Computationally, we develop an efficient implementation using the proximal gradient algorithm, and we prove new, closed-form proximal steps. We conduct extensive experiments on three real-world video datasets and confirm the effectiveness of the proposed approach.
Chang, X., Yu, Y.L., Yang, Y. & Hauptmann, A.G. 2015, 'Searching persuasively: Joint event detection and evidence recounting with limited supervision', Proceedings of the 23rd ACM international conference on Multimedia, ACM international conference on Multimedia, ACM, Brisbane, Australia, pp. 581-590.
View/Download from: Publisher's site
© 2015 ACM. Multimedia event detection (MED) and multimedia event recounting (MER) are fundamental tasks in managing large amounts of unconstrained web videos, and have attracted a lot of attention in recent years. Most existing systems perform MER as a postprocessing step on top of the MED results. In order to leverage the mutual benefits of the two tasks, we propose a joint framework that simultaneously detects high-level events and localizes the indicative concepts of the events. Our premise is that a good recounting algorithm should not only explain the detection result, but should also be able to assist detection in the first place. Coupled in a joint optimization framework, recounting improves detection by pruning irrelevant noisy concepts while detection directs recounting to the most discriminative evidences. To better utilize the powerful and interpretable semantic video representation, we segment each video into several shots and exploit the rich temporal structures at shot level. The consequent computational challenge is carefully addressed through a significant improvement of the current ADMM algorithm, which, after eliminating all inner loops and equipping novel closed-form solutions for all intermediate steps, enables us to efficiently process extremely large video corpora. We test the proposed method on the large scale TRECVID MEDTest 2014 and MEDTest 2013 datasets, and obtain very promising results for both MED and MER.
Jiang, L., Yu, S.I., Meng, D., Yang, Y., Mitamura, T. & Hauptmann, A.G. 2015, 'Fast and accurate content-based semantic search in 100M Internet Videos', MM 2015 - Proceedings of the 2015 ACM Multimedia Conference, 2015 ACM Multimedia Conference, Brisbane, Australia, pp. 49-58.
View/Download from: Publisher's site
© 2015 ACM. Large-scale content-based semantic search in video is an interesting and fundamental problem in multimedia analysis and retrieval. Existing methods index a video by the raw concept detection score that is dense and inconsistent, and thus cannot scale to "big data" that are readily available on the Internet. This paper proposes a scalable solution. The key is a novel step called concept adjustment that represents a video by a few salient and consistent concepts that can be efficiently indexed by the modified inverted index. The proposed adjustment model relies on a concise optimization framework with interpretations. The proposed index leverages the text-based inverted index for video retrieval. Experimental results validate the efficacy and the efficiency of the proposed method. The results show that our method can scale up the semantic search while maintaining state-of-Theart search performance. Specifically, the proposed method (with reranking) achieves the best result on the challenging TRECVID Multimedia Event Detection (MED) zeroexample task. It only takes 0.2 second on a single CPU core to search a collection of 100 million Internet videos.
Zhang, L., Yang, Y. & Zimmermann, R. 2015, 'Fine-grained image categorization by localizing tiny object parts from unannotated images', ICMR '15 Proceedings of the 5th ACM on International Conference on Multimedia Retrieval, Annual ACM International Conference on Multimedia Retrieval (ICMR), ACM, Shanghai, China, pp. 107-114.
View/Download from: Publisher's site
This paper proposes a novel fine-grained image categorization model where no object annotation is required in the training/testing stage. The key technique is a dense graph mining algorithm that localizes multi-scale discriminative object parts in each image. In particular, to mimick human hierarchical perception mechanism, a super-pixel pyramid is generated for each image, based on which graphlets from each layer are constructed to seamlessly describe object parts. We observe that graphlets representative to each category are densely distributed in the feature space. Therefore a dense graph mining algorithm is developed to discover graphlets representative to each sub- super-category. Finally, the discovered graphlets from pairwise images are encoded into an image kernel for fine-grained recognition. Experiments on the UCB-200 [32] shown that our method performs competitively to many models relying on the annotated bird parts.
Nie, L.Q., Zhang, L.M., Yang, Y., Wang, M., Hong, R. & Chua, T.S. 2015, 'Beyond Doctors: Future Health Prediction from Multimedia', Proceedings of the 2015 ACM Multimedia Conference, ACM Multimedia Conference, ACM, Brisbane, Australia, pp. 591-600.
View/Download from: Publisher's site
Although chronic diseases cannot be cured, they can be effectively controlled as long as we understand their progressions based on the current observational health records, which is often in the form of multimedia data. A large and growing body of literature has investigated the disease progression problem. However, far too little attention to date has been paid to jointly consider the following three observations of the chronic disease progression: 1) the health statuses at different time points are chronologically similar; 2) the future health statuses of each patient can be comprehensively revealed from the current multimedia and multimodal observations, such as visual scans, digital measurements and textual medical histories; and 3) the discriminative capabilities of different modalities vary significantly in accordance to specific diseases. In the light of these, we propose an adaptive multimodal multi-task learning model to co-regularize the modality agreement, temporal progression and discriminative capabilities of different modalities. We theoretically show that our proposed model is a linear system. Before training our model, we address the data missing problem via the matrix factorization approach. Extensive evaluations on a real-world Alzheimer's disease dataset well verify our proposed model. It should be noted that our model is also applicable to other chronic diseases.
Liu, G., Yan, Y., Ricci, E., Yang, Y., Han, Y., Winkler, S. & Sebe, N. 2015, 'Inferring Painting Style with Multi-task Dictionary Learning', IJCAI'15 Proceedings of the 24th International Conference on Artificial Intelligence, International Joint Conference on Artificial Intelligence, AAAI Press, Buenos Aires, pp. 2162-2168.
Xu, Z., Tsang, W., Yang, Y., Ma, Z. & Hauptmann, A.G. 2014, 'Event Detection using Multi-Level Relevance Labels and Multiple Features', 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Columbus, OH, pp. 97-104.
View/Download from: Publisher's site
We address the challenging problem of utilizing related exemplars for complex event detection while multiple features are available. Related exemplars share certain positive elements of the event, but have no uniform pattern due to the huge variance of relevance levels among different related exemplars. None of the existing multiple feature fusion methods can deal with the related exemplars. In this paper, we propose an algorithm which adaptively utilizes the related exemplars by cross-feature learning. Ordinal labels are used to represent the multiple relevance levels of the related videos. Label candidates of related exemplars are generated by exploring the possible relevance levels of each related exemplar via a cross-feature voting strategy. Maximum margin criterion is then applied in our framework to discriminate the positive and negative exemplars, as well as the related exemplars from different relevance levels. We test our algorithm using the large scale TRECVID 2011 dataset and it gains promising performance.
Yu, Z., Wu, F., Yang, Y., Tian, Q., Luo, J. & Zhuang, Y. 2014, 'Discriminative coupled dictionary hashing for fast cross-media retrieval', SIGIR 2014 - Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 395-404.
View/Download from: Publisher's site
Cross-media hashing, which conducts cross-media retrieval by embedding data from different modalities into a common low-dimensional Hamming space, has attracted intensive attention in recent years. The existing cross-media hashing approaches only aim at learning hash functions to preserve the intra-modality and inter-modality correlations, but do not directly capture the underlying semantic information of the multi-modal data. We propose a discriminative coupled dictionary hashing (DCDH) method in this paper. In DCDH, the coupled dictionary for each modality is learned with side information (e.g., categories). As a result, the coupled dictionaries not only preserve the intra-similarity and inter-correlation among multi-modal data, but also contain dictionary atoms that are semantically discriminative (i.e., the data from the same category is reconstructed by the similar dictionary atoms). To perform fast cross-media retrieval, we learn hash functions which map data from the dictionary space to a low-dimensional Hamming space. Besides, we conjecture that a balanced representation is crucial in cross-media retrieval. We introduce multi-view features on the relatively "weak" modalities into DCDH and extend it to multiview DCDH (MV-DCDH) in order to enhance their representation capability. The experiments on two real-world data sets show that our DCDH and MV- DCDH outperform the state-of-the-art methods significantly on cross-media retrieval. Copyright 2014 ACM.
Zhang, L., Yang, Y. & Zimmermann, R. 2014, 'Discriminative cellets discovery for fine-grained image categories retrieval', ICMR 2014 - Proceedings of the ACM International Conference on Multimedia Retrieval 2014, pp. 57-64.
View/Download from: Publisher's site
Fine-grained image categories recognition is a challenging task aiming at distinguishing objects belonging to the same basic-level category, such as leaf or mushroom. It is a useful technique that can be applied for species recognition, face verification, and etc. Most of the existing methods have difficulties to automatically detect discriminative object components. In this paper, we propose a new fine- grained image categorization model that can be deemed as an improved version spatial pyramid matching (SPM). In- stead of the conventional SPM that enumeratively conducts cell-to-cell matching between images, the proposed model combines multiple cells into cellets that are highly responsive to object fine-grained categories. In particular, we describe object components by cellets that connect spatially adjacent cells from the same pyramid level. Straightforwardly, image categorization can be casted as the matching between cellets extracted from pairwise images. Toward an effective matching process, a hierarchical sparse coding algorithm is derived that represents each cellet by a linear combination of the basis cellets. Further, a linear discriminant analy- sis (LDA)-like scheme is employed to select the cellets with high discrimination. On the basis of the feature vector built from the selected cellets, fine-grained image categorization is conducted by training a linear SVM. Experimental results on the Caltech-UCSD birds, the Leeds butterflies, and the COSMIC insects data sets demonstrate our model out- performs the state-of-the-art. Besides, the visualized cellets show discriminative object parts are localized accurately. Copyright 2014 ACM.
Gao, C., Yang, Y., Liu, G., Meng, D., Cai, Y., Xu, S., Tong, W., Shen, H. & Hauptmann, A.G. 2014, 'Interactive surveillance event detection through mid-level discriminative representation', ICMR 2014 - Proceedings of the ACM International Conference on Multimedia Retrieval 2014, pp. 305-312.
View/Download from: Publisher's site
Event detection from real surveillance videos with complicated background environment is always a very hard task. Different from the traditional retrospective and interactive systems designed on this task, which are mainly executed on video fragments located within the event-occurrence time, in this paper we propose a new interactive system constructed on the mid-level discriminative representations (patches/ shots) which are closely related to the event (might occur beyond the event-occurrence period) and are easier to be detected than video fragments. By virtue of such easilydistinguished mid-level patterns, our framework realizes an effective labor division between computers and human participants. The task of computers is to train classifiers on a bunch of mid-level discriminative representations, and to sort all the possible mid-level representations in the evaluation sets based on the classifier scores. The task of human participants is then to readily search the events based on the clues offered by these sorted mid-level representations. For computers, such mid-level representations, with more concise and consistent patterns, can be more accurately detected than video fragments utilized in the conventional framework, and on the other hand, a human participant can always much more easily search the events of interest implicated by these location-anchored mid-level representations than conventional video fragments containing entire scenes. Both of these two properties facilitate the availability of our framework in real surveillance event detection applications. Copyright is held by the owner/author(s).
Ma, Yang, Y., Sebe & Hauptmann 2014, 'Multiple Features But Few Labels? A Symbiotic Solution Exemplified for Video Analysis', ACM Multimedia.
Xu, Ye, Li, Liu, Yang, Y. & Ding 2014, 'Dynamic Background Learning through Deep Auto-encoder Networks', ACM Multimedia.
Yang, Y., Shen, Yu, Meng & Hauptmann 2014, 'Unsupervised Video Adaptation for Parsing Human Motion', European Conference on Computer Vision.
Chang, Nie, Yang, Y. & Huang 2014, 'A Convex Formulation for Semi-supervised Multi-Label Feature Selection', AAAI Conference on Artificial Intelligence.
Peng, Meng, Xu, Gao, Yang, Y. & Zhang 2014, 'Decomposable Nonlocal Tensor Dictionary Learning for Multispectral Image Denoising', Computer Vision and Pattern Recognition, Columbus, Ohio, USA.
Li, Li, Wang, Yang, Y., Zhang & Zhou 2014, 'Overcoming Semantic Drift in Information Extraction', In Proceedings of the 17th International Conference on Extending Database Technology.
Lan, Z.Z., Yang, Y., Ballas, N., Yu, S.I. & Haputmann, A. 2014, 'Resource constrained multimedia event detection', Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), pp. 388-399.
View/Download from: Publisher's site
We present a study comparing the cost and efficiency tradeoffs of multiple features for multimedia event detection. Low-level as well as semantic features are a critical part of contemporary multimedia and computer vision research. Arguably, combinations of multiple feature sets have been a major reason for recent progress in the field, not just as a low dimensional representations of multimedia data, but also as a means to semantically summarize images and videos. However, their efficacy for complex event recognition in unconstrained videos on standardized datasets has not been systematically studied. In this paper, we evaluate the accuracy and contribution of more than 10 multi-modality features, including semantic and low-level video representations, using two newly released NIST TRECVID Multimedia Event Detection (MED) open source datasets, i.e. MEDTEST and KINDREDTEST, which contain more than 1000 hours of videos. Contrasting multiple performance metrics, such as average precision, probability of missed detection and minimum normalized detection cost, we propose a framework to balance the trade-off between accuracy and computational cost. This study provides an empirical foundation for selecting feature sets that are capable of dealing with large-scale data with limited computational resources and are likely to produce superior multimedia event detection accuracy. This framework also applies to other resource limited multimedia analyses such as selecting/fusing multiple classifiers and different representations of each feature set. © 2014 Springer International Publishing.
Xu, Z., Yang, Y., Kassim, A., Yan, S. & IEEE 2014, 'CROSS-MEDIA RELEVANCE MINING FOR EVALUATING TEXT-BASED IMAGE SEARCH ENGINE', 2014 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO WORKSHOPS (ICMEW).
Xu, Z., Yang, Y., Tsang, I., Hauptmann, A. & Sebe, N. 2013, 'Feature Weighting via Optimal Thresholding for Video Analysis', 2013 IEEE International Conference on Computer Vision, International Conference on Computer Vision, IEEE, Sydney, pp. 3340-3447.
View/Download from: UTS OPUS or Publisher's site
Fusion of multiple features can boost the performance of large-scale visual classification and detection tasks like TRECVID Multimedia Event Detection (MED) competition [1]. In this paper, we propose a novel feature fusion approach, namely Feature Weighting via Optimal Thresholding (FWOT) to effectively fuse various features. FWOT learns the weights, thresholding and smoothing parameters in a joint framework to combine the decision values obtained from all the individual features and the early fusion. To the best of our knowledge, this is the first work to consider the weight and threshold factors of fusion problem simultaneously. Compared to state-of-the-art fusion algorithms, our approach achieves promising improvements on HMDB [8] action recognition dataset and CCV [5] video classification dataset. In addition, experiments on two TRECVID MED 2011 collections show that our approach outperforms the state-of-the-art fusion methods for complex event detection.
Wu, F., Tan, X., Yang, Y., Tao, D., Tang, S. & Zhuang, Y. 2013, 'Supervised Nonnegative Tensor Factorization with Maximum-Margin Constraint', Twenty-Seventh AAAI Conference on Artificial Intelligence, AAAI Conference on Artificial Intelligence, AAAI Press, Bellevue, Washington, USA, pp. 962-968.
View/Download from: UTS OPUS
Non-negative tensor factorization (NTF) has attracted great attention in the machine learning community. In this paper, we extend traditional non-negative tensor factorization into a supervised discriminative decomposition, referred as Supervised Non-negative Tensor Factorization with Maximum-Margin Constraint(SNTFM2). SNTFM2 formulates the optimal discriminative factorization of non-negative tensorial data as a coupled least-squares optimization problem via a maximum-margin method. As a result, SNTFM2 not only faithfully approximates the tensorial data by additive combinations of the basis, but also obtains a strong generalization power to discriminative analysis (in particularfor classification in this paper). The experimental results show the superiority of our proposed model over state-of-the-art techniques on both toy and real world data sets.
Yang, Y., Ma, Z., Xu, Z., Yan, S., Hauptmann, A.G. & IEEE 2013, 'How Related Exemplars Help Complex Event Detection in Web Videos?', 2013 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), pp. 2104-2111.
View/Download from: Publisher's site
Ballas, N., Yang, Y., Lan, Z.Z., Delezoide, B., Preteux, F. & Hauptmann, A. 2013, 'Space-time robust representation for action recognition', Proceedings of the IEEE International Conference on Computer Vision, pp. 2704-2711.
View/Download from: Publisher's site
We address the problem of action recognition in unconstrained videos. We propose a novel content driven pooling that leverages space-time context while being robust toward global space-time transformations. Being robust to such transformations is of primary importance in unconstrained videos where the action localizations can drastically shift between frames. Our pooling identifies regions of interest using video structural cues estimated by differ ent saliency functions. To combine the different structural information, we introduce an iterative structure learning algorithm, WSVM (weighted SVM), that determines the optimal saliency layout of an action model through a sparse regularizer. A new optimization method is proposed to solve the WSVM' highly non-smooth objective function. We evaluate our approach on standard action datasets (KTH, UCF50 and HMDB). Most noticeably, the accuracy of our algorithm reaches 51.8% on the challenging HMDB dataset which outperforms the state-of-the-art of 7.3% relatively. © 2013 IEEE.
Cao, X., Wei, X., Han, Y., Yang, Y. & Lin, D. 2013, 'Robust tensor clustering with non-greedy maximization', IJCAI International Joint Conference on Artificial Intelligence, pp. 1254-1259.
Tensors are increasingly common in several areas such as data mining, computer graphics, and computer vision. Tensor clustering is a fundamental tool for data analysis and pattern discovery. However, there usually exist outlying data points in realworld datasets, which will reduce the performance of clustering. This motivates us to develop a tensor clustering algorithm that is robust to the outliers. In this paper, we propose an algorithm of Robust Tensor Clustering (RTC). The RTC firstly finds a lower rank approximation of the original tensor data using a L1 norm optimization function. Because the L1 norm doesn't exaggerate the effect of outliers compared with L2 norm, the minimization of the L1 norm approximation function makes RTC robust to outliers. Then we compute the HOSVD decomposition of this approximate tensor to obtain the final clustering results. Different from the traditional algorithm solving the approximation function with a greedy strategy, we utilize a non-greedy strategy to obtain a better solution. Experiments demonstrate that RTC has better performance than the state-ofthe- art algorithms and is more robust to outliers.
Song, J., Yang, Y., Yang, Y., Huang, Z. & Shen, H.T. 2013, 'Inter-media hashing for large-scale retrieval from heterogeneous data sources', Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 785-796.
View/Download from: Publisher's site
In this paper, we present a new multimedia retrieval paradigm to innovate large-scale search of heterogenous multimedia data. It is able to return results of different media types from heterogeneous data sources, e.g., using a query image to retrieve relevant text documents or images from different data sources. This utilizes the widely available data from different sources and caters for the current users' demand of receiving a result list simultaneously containing multiple types of data to obtain a comprehensive understanding of the query's results. To enable large-scale inter-media retrieval, we propose a novel inter-media hashing (IMH) model to explore the correlations among multiple media types from different data sources and tackle the scalability issue. To this end, multimedia data from heterogeneous data sources are transformed into a common Hamming space, in which fast search can be easily implemented by XOR and bit-count operations. Furthermore, we integrate a linear regression model to learn hashing functions so that the hash codes for new data points can be efficiently generated. Experiments conducted on real-world large-scale multimedia datasets demonstrate the superiority of our proposed method compared with state-of-the-art techniques. Copyright © 2013 ACM.
Ma, Yang, Y., Xu, Sebe & Hauptmann 2013, 'We Are Not Equally Negative: Fine-grained Labeling for Multimedia Event Detection', ACM Multimedia.
Yu, Yang, Y. & Hauptmann 2013, 'Harry Potter's Marauder's Map: Localizing and Tracking Multiple Persons-of-Interest by Nonnegative Discretization', IEEE Conference on Computer Vision and Pattern Recognition, Portland, USA.
Ma, Yang, Y., Xu, Sebe, Yan & Hauptmann 2013, 'Complex Event Detection via Multi-Source Video Attributes', IEEE Conference on Computer Vision and Pattern Recognition.
Han, Yang, Y. & Zhou 2013, 'Co-Regularized Ensemble for Feature Selection', International Joint Conference on Artificial Intelligence.
Ma, Yang, Y., Nie & Sebe 2013, 'Thinking of Images as What They Are: Compound Matrix Regression for Image Classification', International Joint Conference on Artificial Intelligence.
Zheng, Shang, Yuan & Yang, Y. 2013, 'Towards Efficient Search for Activity Trajectories', IEEE Int'l Conf. on Data Engineering, Brisbane, Australia.
Cai, Y., Yang, Y., Hauptmann, A.G. & Wactlar, H.D. 2013, 'A cognitive assistive system for monitoring the use of home medical devices', MIIRH 2013 - Proceedings of the 1st ACM International Workshop on Multimedia Indexing and Information Retrieval for Heathcare, Co-located with ACM Multimedia 2013, pp. 59-66.
View/Download from: Publisher's site
Despite the popularity of home medical devices, serious safety concerns have been raised, because the use-errors of home medical devices have linked to a large number of fatal hazards. To resolve the problem, we introduce a cognitive assistive system to automatically monitor the use of home medical devices. Being able to accurately recognize user operations is one of the most important functionalities of the proposed system. However, even though various action recognition algorithms have been proposed in recent years, it is still unknown whether they are adequate for recognizing operations in using home medical devices. Since the lack of the corresponding database is the main reason causing the situation, at the first part of this paper, we present a database specially designed for studying the use of home medical devices. Then, we evaluate the performance of the existing approaches on the proposed database. Although using state-of-art approaches which have demonstrated near perfect performance in recognizing certain general human actions, we observe significant performance drop when applying it to recognize device operations. We conclude that the tiny action involved in using devices is one of the most important reasons leading to the performance decrease. To accurately recognize tiny actions, it's critical to focus on where the target action happens, namely the region of interest(ROI) and have more elaborate action modeling based on the ROI. Therefore, in the second part of this paper, we introduce a simple but effective approach to estimating ROI for recognizing tiny actions. The key idea of this method is to analyze the correlation between an action and the sub-regions of a frame. The estimated ROI is then used as a filter for building more accurate action representations. Experimental results show significant performance improvements over the baseline methods by using the estimated ROI for action recognition. © 2013 ACM.
Yu, S.I., Xu, Z., Ding, D., Sze, W., Vicente, F., Lan, Z., Cai, Y., Rawat, S., Schulam, P., Markandaiah, N., Bahmani, S., Juarez, A., Tong, W., Yang, Y., Burger, S., Metze, F., Singh, R., Raj, B., Stern, R., Mitamura, T., Nyberg, E. & Hauptmann, A. 2012, 'Informedia e-lamp @ TRECVID 2012 multimedia event detection and recounting MED and MER', 2012 TREC Video Retrieval Evaluation Notebook Papers.
We report on our system used in the TRECVID 2012 Multimedia Event Detection (MED) and Multimedia Event Recounting (MER) tasks. For MED, it consists of three main steps: extracting features, training detectors and fusion. In the feature extraction part, we extract many low-level, high-level, and text features. Those features are then represented in three different ways which are spatial bag-of words with standard tiling, spatial bag-of-words with feature and event specific tiling and the Gaussian Mixture Model Super Vector. In the detector training and fusion, two classifiers and three fusion methods are employed. The results from both the official sources and our internal evaluations show good performance of our system. Our MER system utilizes a subset of features and detection results from the MED system from which the recounting is generated.
Ma, Z., Yang, Y., Cai, Y., Sebe, N. & Hauptmann, A.G. 2012, 'Knowledge adaptation for ad hoc multimedia event detection with few exemplars', MM 2012 - Proceedings of the 20th ACM International Conference on Multimedia, pp. 469-478.
View/Download from: Publisher's site
Multimedia event detection (MED) has a significant impact on many applications. Though video concept annotation has received much research effort, video event detection remains largely unaddressed. Current research mainly focuses on sports and news event detection or abnormality detection in surveillance videos. Our research on this topic is capable of detecting more complicated and generic events. Moreover, the curse of reality, i.e., precisely labeled multimedia content is scarce, necessitates the study on how to attain respectable detection performance using only limited positive examples. Research addressing these two aforementioned issues is still in its infancy. In light of this, we explore Ad Hoc MED, which aims to detect complicated and generic events by using few positive examples. To the best of our knowledge, our work makes the first attempt on this topic. As the information from these few positive examples is limited, we propose to infer knowledge from other multimedia resources to facilitate event detection. Experiments are performed on real-world multimedia archives consisting of several challenging events. The results show that our approach outperforms several other detection algorithms. Most notably, our algorithm outperforms SVM by 43% and 14% comparatively in Average Precision when using Gaussian and 2 kernel respectively. © 2012 ACM.
Yang, Y., Yang, Y., Huang, Z., Liu, J. & Ma, Z. 2012, 'Robust cross-media transfer for visual event detection', MM 2012 - Proceedings of the 20th ACM International Conference on Multimedia, pp. 1045-1048.
View/Download from: Publisher's site
In this paper, we present a novel approach, named Robust Cross-Media Transfer (RCMT), for visual event detection in social multimedia environments. Different from most existing methods, the proposed method can directly take different types of noisy social multimedia data as input and conduct robust event detection. More specifically, we build a robust model by employing an l 2,1-norm regression model featuring noise tolerance, and also manage to integrate different types of social multimedia data by minimizing the distribution difference among them. Experimental results on real-life Flickr image dataset and YouTube video dataset demonstrate the effectiveness of our proposal, compared to state-of-the-art algorithms. © 2012 ACM.
Li, Z., Yang, Y., Liu, J., Zhou, X. & Lu, H. 2012, 'Unsupervised feature selection using nonnegative spectral analysis', Proceedings of the National Conference on Artificial Intelligence, pp. 1026-1032.
In this paper, a new unsupervised learning algorithm, namely Nonnegative Discriminative Feature Selection (NDFS), is proposed. To exploit the discriminative information in unsupervised scenarios, we perform spectral clustering to learn the cluster labels of the input samples, during which the feature selection is performed simultaneously. The joint learning of the cluster labels and feature selection matrix enables NDFS to select the most discriminative features. To learn more accurate cluster labels, a nonnegative constraint is explicitly imposed to the class indicators. To reduce the redundant or even noisy features, 2,1 -norm minimization constraint is added into the objective function, which guarantees the feature selection matrix sparse in rows. Our algorithm exploits the discriminative information and feature correlation simultaneously to select a better feature subset. A simple yet efficient iterative algorithm is designed to optimize the proposed objective function. Experimental results on different real world datasets demonstrate the encouraging performance of our algorithm over the state-of-the-arts. Copyright © 2012, Association for the Advancement of Artificial Intelligence. All rights reserved.
Wang, S., Yang, Y., Ma, Z., Li, X., Pang, C., Hauptmann, A.G. & IEEE 2012, 'Action Recognition by Exploring Data Distribution and Feature Correlation', 2012 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), pp. 1370-1377.
Ma, Z., Yang, Y., Hauptmann, A.G. & Sebe, N. 2012, 'Classifier-specific intermediate representation for multimedia tasks', Proceedings of the 2nd ACM International Conference on Multimedia Retrieval, ICMR 2012.
View/Download from: Publisher's site
Video annotation and multimedia classification play important roles in many applications such as video indexing and retrieval. To improve video annotation and event detection, researchers have proposed using intermediate concept classifiers with concept lexica to help understand the videos. Yet it is difficult to judge how many and what concepts would be sufficient for the particular video analysis task. Additionally, obtaining robust semantic concept classifiers requires a large number of positive training examples, which in turn has high human annotation cost. In this paper, we propose an approach that is able to automatically learn an intermediate representation from video features together with a classifier. The joint optimization of the two components makes them mutually beneficial and reciprocal. Effectively, the intermediate representation and the classifier are tightly correlated. The classifier dependent intermediate representation not only accurately reflects the task semantics but is also more suitable for the specific classifier. Thus we have created a discriminative semantic analysis framework based on a tightly-coupled intermediate representation. Several experiments on video annotation and multimedia event detection using real-world videos demonstrate the effectiveness of the proposed approach. Copyright © 2012 ACM.
Cao, L., Ji, R., Gao, Y., Yang, Y. & Tian, Q. 2012, 'Weakly supervised sparse coding with geometric consistency pooling', Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 3578-3585.
View/Download from: Publisher's site
Most recently the Bag-of-Features (BoF) representation has been well advocated for image search and classification, with two decent phases named sparse coding and max pooling to compensate quantization loss as well as inject spatial layouts. But still, much information has been discarded by quantizing local descriptors with two-dimensional layouts into a one-dimensional BoF histogram. In this paper, we revisit this popular sparse coding max pooling paradigm by looking around the local descriptor context towards an optimal BoF. First, we introduce a Weakly supervised Sparse Coding (WSC) to exploit the Classemes-based attribute labeling to refine the descriptor coding procedure. It is achieved by learning an attribute-to-word co-occurrence prior to impose a label inconsistency distortion over the 1 based coding regularizer, such that the descriptor codes can maximally preserve the image semantic similarity. Second, we propose an adaptive feature pooling scheme over superpixels rather than over fixed spatial pyramids, named Geometric Consistency Pooling (GCP). As an effect, local descriptors enjoying good geometric consistency are pooled together to ensure a more precise spatial layouts embedding in BoF. Both of our phases are unsupervised, which differ from the existing works in supervised dictionary learning, sparse coding and feature pooling. Therefore, our approach enables potential applications like scalable visual search. We evaluate in both image classification and search benchmarks and report good improvements over the state-of-the-arts. © 2012 IEEE.
Yang, Y., Hauptmann, A., Chen, M.Y., Cai, Y., Bharucha, A. & Wactlar, H. 2012, 'Learning to predict health status of geriatric patients from observational data', 2012 IEEE Symposium on Computational Intelligence and Computational Biology, CIBCB 2012, pp. 127-134.
View/Download from: Publisher's site
Data for diagnosis and clinical studies are now typically gathered by hand. While more detailed, exhaustive behavioral assessments scales have been developed, they have the drawback of being too time consuming and manual assessment can be subjective. Besides, clinical knowledge is required for accurate manual assessment, for which extensive training is needed. Therefore our great research challenge is to leverage machine learning techniques to better understand patients health status automatically based on continuous computer observations. In this paper, we study the problem of health status prediction for geriatric patients using observational data. In the first part of this paper, we propose a distance metric learning algorithm to learn a Mahalanobis distance which is more precise for similarity measures. In the second part, we propose a robust classifier based on 2,1 -norm regression to predict the geriatric patients' health status. We test the algorithm on a dataset collected from a nursing home. Experiment shows that our algorithm achieves encouraging performance. © 2012 IEEE.
Yang, Y. & Bao, C. 2012, 'A novel discriminant locality preserving projections for MDM-based speaker classification', 2012 THIRD GLOBAL CONGRESS ON INTELLIGENT SYSTEMS (GCIS 2012), 3rd Global Congress on Intelligent Systems (GCIS), IEEE, Wuhan, PEOPLES R CHINA, pp. 127-130.
View/Download from: Publisher's site
Shen, H.T., Shao, J., Huang, Z., Yang, Y., Song, J., Liu, J. & Zhu, X. 2011, 'UQMSG experiments for TRECVID 2011', 2011 TREC Video Retrieval Evaluation Notebook Papers.
This paper describes the experimental framework of the University of Queensland's Multimedia Search Group (UQMSG) at TRECVID 2011. We participated in two tasks this year, both for the first time. For the semantic indexing task, we submitted four lite runs: L_A_UQMSG1_1, L_A_UQMSG2_2, L_A_UQMSG3_3 and L_A_UQMSG4_4. They are all of training type A (actually we only used IACC.1.tv10.training data), but with different parameter settings in our keyframe-based Laplacian Joint Group Lasso (LJGL) algorithm with Local Binary Patterns (LBP) feature. For the content-based copy detection task, we submitted two runs: UQMSG.m.nofa.mfh and UQMSG.m.balanced.mfh. They used only the video modality information of keyframes and were both based on our Multiple Feature Hashing (MFH) algorithm that fuses local (LBP) and global (HSV) visual features, with different application profiles (reducing the false alarm rate v.s. balancing false alarms and misses). Due to time constraint, we were not able to improve the performance of our systems adequately on all the available training data this year for these tasks. Evaluation results suggest that more efforts need to be made to well tune system parameters. In addition, sophisticated techniques beyond applying keyframe-level semantic concept propagation and near-duplicate detection are required for achieving better performance in video tasks.
Yang, Y., Shen, H.T., Ma, Z., Huang, Z. & Zhou, X. 2011, '2,1-Norm regularized discriminative feature selection for unsupervised learning', IJCAI International Joint Conference on Artificial Intelligence, pp. 1589-1594.
View/Download from: Publisher's site
Compared with supervised learning for feature selection, it is much more difficult to select the discriminative features in unsupervised learning due to the lack of label information. Traditional unsupervised feature selection algorithms usually select the features which best preserve the data distribution, e.g., manifold structure, of the whole feature set. Under the assumption that the class label of input data can be predicted by a linear classifier, we incorporate discriminative analysis and 2,1 -norm minimization into a joint framework for unsupervised feature selection. Different from existing unsupervised feature selection algorithms, our algorithm selects the most discriminative feature subset from the whole feature set in batch mode. Extensive experiment on different data types demonstrates the effectiveness of our algorithm.
Yang, Y., Shen, H.T., Nie, F., Ji, R. & Zhou, X. 2011, 'Nonnegative spectral clustering with discriminative regularization', Proceedings of the National Conference on Artificial Intelligence, pp. 555-560.
Clustering is a fundamental research topic in the field of data mining. Optimizing the objective functions of clustering algorithms, e.g. normalized cut and k-means, is an NP-hard optimization problem. Existing algorithms usually relax the elements of cluster indicator matrix from discrete values to continuous ones. Eigenvalue decomposition is then performed to obtain a relaxed continuous solution, which must be discretized. The main problem is that the signs of the relaxed continuous solution are mixed. Such results may deviate severely from the true solution, making it a nontrivial task to get the cluster labels. To address the problem, we impose an explicit nonnegative constraint for a more accurate solution during the relaxation. Besides, we additionally introduce a discriminative regularization into the objective to avoid overfitting. A new iterative approach is proposed to optimize the objective. We show that the algorithm is a general one which naturally leads to other extensions. Experiments demonstrate the effectiveness of our algorithm. Copyright © 2011, Association for the Advancement of Artificial Intelligence. All rights reserved.
Song, J., Yang, Y., Huang, Z., Shen, H.T. & Hong, R. 2011, 'Multiple feature hashing for real-time large scale near-duplicate video retrieval', MM'11 - Proceedings of the 2011 ACM Multimedia Conference and Co-Located Workshops, pp. 423-432.
View/Download from: Publisher's site
Near-duplicate video retrieval (NDVR) has recently attracted lots of research attention due to the exponential growth of online videos. It helps in many areas, such as copyright protection, video tagging, online video usage monitoring, etc. Most of existing approaches use only a single feature to represent a video for NDVR. However, a single feature is often insufficient to characterize the video content. Besides, while the accuracy is the main concern in previous literatures, the scalability of NDVR algorithms for large scale video datasets has been rarely addressed. In this paper, we present a novel approach - Multiple Feature Hashing (MFH) to tackle both the accuracy and the scalability issues of NDVR. MFH preserves the local structure information of each individual feature and also globally consider the local structures for all the features to learn a group of hash functions which map the video keyframes into the Hamming space and generate a series of binary codes to represent the video dataset. We evaluate our approach on a public video dataset and a large scale video dataset consisting of 132,647 videos, which was collected from YouTube by ourselves. The experiment results show that the proposed method outperforms the state-of-the-art techniques in both accuracy and efficiency. © 2011 ACM.
Ma, Z., Yang, Y., Nie, F., Uijlings, J. & Sebe, N. 2011, 'Exploiting the entire feature space with sparsity for automatic image annotation', MM'11 - Proceedings of the 2011 ACM Multimedia Conference and Co-Located Workshops, pp. 283-292.
View/Download from: Publisher's site
The explosive growth of digital images requires effective methods to manage these images. Among various existing methods, automatic image annotation has proved to be an important technique for image management tasks, e.g., image retrieval over large-scale image databases. Automatic image annotation has been widely studied during recent years and a considerable number of approaches have been proposed. However, the performance of these methods is yet to be satisfactory, thus demanding more effort on research of image annotation. In this paper, we propose a novel semi-supervised framework built upon feature selection for automatic image annotation. Our method aims to jointly select the most relevant features from all the data points by using a sparsity-based model and exploiting both labeled and unlabeled data to learn the manifold structure. Our framework is able to simultaneously learn a robust classifier for image annotation by selecting the discriminating features related to the semantic concepts. To solve the objective function of our framework, we propose an efficient iterative algorithm. Extensive experiments are performed on different realworld image datasets with the results demonstrating the promising performance of our framework for automatic image annotation. © 2011 ACM.
Yang, Y., Yang, Y., Huang, Z. & Shen, H.T. 2011, 'Transfer tagging from image to video', MM'11 - Proceedings of the 2011 ACM Multimedia Conference and Co-Located Workshops, pp. 1137-1140.
View/Download from: Publisher's site
Nowadays massive amount of web video datum has been emerging on the Internet. To achieve an effective and efficient video retrieval, it is critical to automatically assign semantic keywords to the videos via content analysis. However, most of the existing video tagging methods suffer from the problem of lacking sufficient tagged training videos due to high labor cost of manual tagging. Inspired by the observation that there are much more well-labeled data in other yet relevant types of media (e.g. images), in this paper we study how to build a "cross-media tunnel" to transfer external tag knowledge from image to video. Meanwhile, the intrinsic data structures of both image and video spaces are well explored for inferring tags. We propose a Cross-Media Tag Transfer (CMTT) paradigm which is able to: 1) transfer tag knowledge between image and video by minimizing their distribution difference; 2) infer tags by revealing the underlying manifold structures embedded within both image and video spaces. We also learn an explicit mapping function to handle unseen videos. Experimental results have been reported and analyzed to illustrate the superiority of our proposal. Copyright 2011 ACM.
Yang, Y., Yang, Y., Huang, Z., Shen, H.T. & Nie, F. 2011, 'Tag localization with spatial correlations and joint group sparsity', Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 881-888.
View/Download from: Publisher's site
Nowadays numerous social images have been emerging on the Web. How to precisely label these images is critical to image retrieval. However, traditional image-level tagging methods may become less effective because global image matching approaches can hardly cope with the diversity and arbitrariness of Web image content. This raises an urgent need for the fine-grained tagging schemes. In this work, we study how to establish mapping between tags and image regions, i.e. localize tags to image regions, so as to better depict and index the content of images. We propose the spatial group sparse coding (SGSC) by extending the robust encoding ability of group sparse coding with spatial correlations among training regions. We present spatial correlations in a two-dimensional image space and design group-specific spatial kernels to produce a more interpretable regularizer. Further we propose a joint version of the SGSC model which is able to simultaneously encode a group of intrinsically related regions within a test image. An effective algorithm is developed to optimize the objective function of the Joint SGSC. The tag localization task is conducted by propagating tags from sparsely selected groups of regions to the target regions according to the reconstruction coefficients. Extensive experiments on three public image datasets illustrate that our proposed models achieve great performance improvements over the state-of-the-art method in the tag localization task. © 2011 IEEE.
Wang, H., Nie, F., Huang, H. & Yang, Y. 2011, 'Learning frame relevance for video classification', MM'11 - Proceedings of the 2011 ACM Multimedia Conference and Co-Located Workshops, pp. 1345-1348.
View/Download from: Publisher's site
Traditional video classification methods typically require a large number of labeled training video frames to achieve satisfactory performance. However, in the real world, we usually only have sufficient labeled video clips (such as tagged online videos) but lack labeled video frames. In this paper, we formalize the video classification problem as a Multi-Instance Learning (MIL) problem, an emerging topic in machine learning in recent years, which only needs bag (video clip) labels. To solve the problem, we propose a novel Parameterized Class-to-Bag (P-C2B) Distance method to learn the relative importance of a training instance with respect to its labeled classes, such that the instance level labeling ambiguity in MIL is tackled and the frame relevances of training video data with respect to the semantic concepts of interest are given. Promising experimental results have demonstrated the effectiveness of the proposed method. Copyright 2011 ACM.
Yang, Y., Nie, F., Xiang, S., Zhuang, Y. & Wang, W. 2010, 'Local and Global Regressive Mapping for manifold learning with out-of-sample extrapolation', Proceedings of the National Conference on Artificial Intelligence, pp. 649-654.
Over the past few years, a large family of manifold learning algorithms have been proposed, and applied to various applications. While designing new manifold learning algorithms has attracted much research attention, fewer research efforts have been focused on out-of-sample extrapolation of learned manifold. In this paper, we propose a novel algorithm of manifold learning. The proposed algorithm, namely Local and Global Regressive Mapping (LGRM), employs local regression models to grasp the manifold structure. We additionally impose a global regression term as regularization to learn a model for out-of-sample data extrapolation. Based on the algorithm, we propose a new manifold learning framework. Our framework can be applied to any manifold learning algorithms to simultaneously learn the low dimensional embedding of the training data and a model which provides explicit mapping of the out-of-sample data to the learned manifold. Experiments demonstrate that the proposed framework uncover the manifold structure precisely and can be freely applied to unseen data. Copyright © 2010, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
Yang, Y., Zhuang, Y., Xu, D., Pan, Y., Tao, D. & Maybank, S. 2009, 'Retrieval Based Interactive Cartoon Synthesis via Unsupervised Bi-Distance Metric Learning', 2009 ACM International Conference on Multimedia Compilation E-Proceedings (with co-located workshops & symposiums), ACM international conference on Multimedia, Association for Computing Machinery, Inc. (ACM), Beijing, China, pp. 311-320.
View/Download from: UTS OPUS or Publisher's site
Cartoons play important roles in many areas, but it requires a lot of labor to produce new cartoon clips. In this paper, we propose a gesture recognition method for cartoon character images with two applications, namely content-based cartoon image retrieval and cartoon clip synthesis. We first define Edge Features (EF) and Motion Direction Features (MDF) for cartoon character images. The features are classified into two different groups, namely intra-features and inter-features. An Unsupervised Bi-Distance Metric Learning (UBDML) algorithm is proposed to recognize the gestures of cartoon character images. Different from the previous research efforts on distance metric learning, UBDML learns the optimal distance metric from the heterogeneous distance metrics derived from intra-features and inter-features. Content-based cartoon character image retrieval and cartoon clip synthesis can be carried out based on the distance metric learned by UBDML. Experiments show that the cartoon character image retrieval has a high precision and that the cartoon clip synthesis can be carried out efficiently.
Yang, Y., Xu, Nie, Luo & Zhuang 2009, 'Ranking with Local Regression and Global Alignment for Cross Media Retrieval', ACM Multimedia.
Yang, Y., Zhuang & Wang 2008, 'Heterogeneous Multimedia Data Semantics Mining using Content and Location Context', ACM Multimedia.
Zhuang, Y. & Yang, Y. 2007, 'Boosting cross-media retrieval by learning with positive and negative examples', Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), pp. 165-174.
View/Download from: Publisher's site
Content-based cross-media retrieval is a new category of retrieval methods by which the modality of query examples and the returned results need not to be the same, for example, users may query images by an example of audio and vice versa. Multimedia Document (MMD) is a set of media objects that are of different modalities but carry the same semantics. In this paper, a graph based approach is proposed to achieve the content-based cross-media retrieval and MMD retrieval. Positive and negative examples of relevance feedback are used differently to boost the retrieval performance and experiments show that the proposed methods are very effective. © Springer-Verlag Berlin Heidelberg 2007.

Journal articles

Luo, M., Nie, F., Chang, X., Yang, Y., Hauptmann, A.G. & Zheng, Q. 2017, 'Avoiding Optimal Mean 2,1-Norm Maximization-Based Robust PCA for Reconstruction.', Neural Computation, vol. 29, no. 4, pp. 1124-1150.
View/Download from: Publisher's site
Robust principal component analysis (PCA) is one of the most important dimension-reduction techniques for handling high-dimensional data with outliers. However, most of the existing robust PCA presupposes that the mean of the data is zero and incorrectly utilizes the average of data as the optimal mean of robust PCA. In fact, this assumption holds only for the squared [Formula: see text]-norm-based traditional PCA. In this letter, we equivalently reformulate the objective of conventional PCA and learn the optimal projection directions by maximizing the sum of projected difference between each pair of instances based on [Formula: see text]-norm. The proposed method is robust to outliers and also invariant to rotation. More important, the reformulated objective not only automatically avoids the calculation of optimal mean and makes the assumption of centered data unnecessary, but also theoretically connects to the minimization of reconstruction error. To solve the proposed nonsmooth problem, we exploit an efficient optimization algorithm to soften the contributions from outliers by reweighting each data point iteratively. We theoretically analyze the convergence and computational complexity of the proposed algorithm. Extensive experimental results on several benchmark data sets illustrate the effectiveness and superiority of the proposed method.
Nie, L., Wei, X., Zhang, D., Wang, X., Gao, Z. & Yang, Y. 2017, 'Data-driven answer selection in community QA systems', IEEE Transactions on Knowledge and Data Engineering, vol. 29, no. 6, pp. 1186-1198.
View/Download from: Publisher's site
© 1989-2012 IEEE. Finding similar questions from historical archives has been applied to question answering, with well theoretical underpinnings and great practical success. Nevertheless, each question in the returned candidate pool often associates with multiple answers, and hence users have to painstakingly browse a lot before finding the correct one. To alleviate such problem, we present a novel scheme to rank answer candidates via pairwise comparisons. In particular, it consists of one offline learning component and one online search component. In the offline learning component, we first automatically establish the positive, negative, and neutral training samples in terms of preference pairs guided by our data-driven observations. We then present a novel model to jointly incorporate these three types of training samples. The closed-form solution of this model is derived. In the online search component, we first collect a pool of answer candidates for the given question via finding its similar questions. We then sort the answer candidates by leveraging the offline trained model to judge the preference orders. Extensive experiments on the real-world vertical and general community-based question answering datasets have comparatively demonstrated its robustness and promising performance. Also, we have released the codes and data to facilitate other researchers.
Chang, X., Ma, Z., Yang, Y., Zeng, Z. & Hauptmann, A.G. 2017, 'Bi-Level Semantic Representation Analysis for Multimedia Event Detection', IEEE Transactions on Cybernetics, vol. 47, no. 5, pp. 1180-1197.
View/Download from: Publisher's site
Chang, X., Ma, Z., Lin, M., Yang, Y. & Hauptmann, A.G. 2017, 'Feature Interaction Augmented Sparse Learning for Fast Kinect Motion Detection', IEEE Transactions on Image Processing, vol. 26, no. 8, pp. 3911-3920.
View/Download from: Publisher's site
Ma, Z., Chang, X., Yang, Y., Sebe, N. & Hauptmann, A.G. 2017, 'The Many Shades of Negativity', IEEE Transactions on Multimedia, vol. 19, no. 7, pp. 1558-1568.
View/Download from: Publisher's site
Nie, L., Zhang, L., Meng, L., Song, X., Chang, X. & Li, X. 2017, 'Modeling Disease Progression via Multisource Multitask Learners: A Case Study With Alzheimer's Disease', IEEE Transactions on Neural Networks and Learning Systems, vol. 28, no. 7, pp. 1508-1519.
View/Download from: Publisher's site
Luo, M., Nie, F., Chang, X., Yang, Y., Hauptmann, A.G. & Zheng, Q. 2017, 'Adaptive Unsupervised Feature Selection With Structure Regularization', IEEE Transactions on Neural Networks and Learning Systems, pp. 1-13.
View/Download from: Publisher's site
Chang, X., Nie, F., Wang, S., Yang, Y., Zhou, X. & Zhang, C. 2016, 'Compound Rank-k Projections for Bilinear Analysis', IEEE Transactions on Neural Networks and Learning Systems, vol. 27, no. 7, pp. 1502-1513.
In many real-world applications, data are represented by matrices or high-order tensors. Despite the promising performance, the existing two-dimensional discriminant analysis algorithms employ a single projection model to exploit the discriminant information for projection, making the model less flexible. In this paper, we propose a novel Compound Rank-k Projection (CRP) algorithm for bilinear analysis. CRP deals with matrices directly without transforming them into vectors, and it therefore preserves the correlations within the matrix and decreases the computation complexity. Different from the existing two dimensional discriminant analysis algorithms, objective function values of CRP increase monotonically.In addition, CRP utilizes multiple rank-k projection models to enable a larger search space in which the optimal solution can be found. In this way, the discriminant ability is enhanced.
Chang, X. & Yang, Y. 2016, 'Semisupervised Feature Analysis by Mining Correlations Among Multiple Tasks', IEEE Transactions on Neural Networks and Learning Systems, pp. 1-12.
View/Download from: Publisher's site
Chang, X., Nie, F., Yang, Y., Zhang, C. & Huang, H. 2016, 'Convex Sparse PCA for Unsupervised Feature Learning', ACM TRANSACTIONS ON KNOWLEDGE DISCOVERY FROM DATA, vol. 11, no. 1.
View/Download from: Publisher's site
Zhang, L., Li, X., Nie, L., Yang, Y. & Xia, Y. 2016, 'Weakly Supervised Human Fixations Prediction', IEEE Transactions on Cybernetics, vol. 46, no. 1, pp. 258-269.
View/Download from: Publisher's site
Automatically predicting human eye fixations is a useful technique that can facilitate many multimedia applications, e.g., image retrieval, action recognition, and photo retargeting. Conventional approaches are frustrated by two drawbacks. First, psychophysical experiments show that an object-level interpretation of scenes influences eye movements significantly. Most of the existing saliency models rely on object detectors, and therefore, only a few prespecified categories can be discovered. Second, the relative displacement of objects influences their saliency remarkably, but current models cannot describe them explicitly. To solve these problems, this paper proposes weakly supervised fixations prediction, which leverages image labels to improve accuracy of human fixations prediction. The proposed model hierarchically discovers objects as well as their spatial configurations. Starting from the raw image pixels, we sample superpixels in an image, thereby seamless object descriptors termed object-level graphlets (oGLs) are generated by random walking on the superpixel mosaic. Then, a manifold embedding algorithm is proposed to encode image labels into oGLs, and the response map of each prespecified object is computed accordingly. On the basis of the object-level response map, we propose spatial-level graphlets (sGLs) to model the relative positions among objects. Afterward, eye tracking data is employed to integrate these sGLs for predicting human eye fixations. Thorough experiment results demonstrate the advantage of the proposed method over the state-of-the-art.
Chen, L., Li, X., Yang, Y., Kurniawati, H., Sheng, Q.Z., Hu, H.Y. & Huang, N. 2016, 'Personal health indexing based on medical examinations: A data mining approach', Decision Support Systems, vol. 81, pp. 54-65.
View/Download from: Publisher's site
We design a method called MyPHI that predicts personal health index (PHI), a new evidence-based health indicator to explore the underlying patterns of a large collection of geriatric medical examination (GME) records using data mining techniques. We define PHI as a vector of scores, each reflecting the health risk in a particular disease category. The PHI prediction is formulated as an optimization problem that finds the optimal soft labels as health scores based on medical records that are infrequent, incomplete, and sparse. Our method is compared with classification models commonly used in medical applications. The experimental evaluation has demonstrated the effectiveness of our method based on a real-world GME data set collected from 102,258 participants.
Gan, C., Yang, Y., Zhu, L., Zhao, D. & Zhuang, Y. 2016, 'Recognizing an Action Using Its Name: A Knowledge-Based Approach', International Journal of Computer Vision, vol. 120, no. 1, pp. 61-77.
View/Download from: Publisher's site
© 2016, Springer Science+Business Media New York. Existing action recognition algorithms require a set of positive exemplars to train a classifier for each action. However, the amount of action classes is very large and the users' queries vary dramatically. It is impractical to pre-define all possible action classes beforehand. To address this issue, we propose to perform action recognition with no positive exemplars, which is often known as the zero-shot learning. Current zero-shot learning paradigms usually train a series of attribute classifiers and then recognize the target actions based on the attribute representation. To ensure the maximum coverage of ad-hoc action classes, the attribute-based approaches require large numbers of reliable and accurate attribute classifiers, which are often unavailable in the real world. In this paper, we propose an approach that merely takes an action name as the input to recognize the action of interest without any pre-trained attribute classifiers and positive exemplars. Given an action name, we first build an analogy pool according to an external ontology, and each action in the analogy pool is related to the target action at different levels. The correlation information inferred from the external ontology may be noisy. We then propose an algorithm, namely adaptive multi-model rank-preserving mapping (AMRM), to train a classifier for action recognition, which is able to evaluate the relatedness of each video in the analogy pool adaptively. As multiple mapping models are employed, our algorithm has better capability to bridge the gap between visual features and the semantic information inferred from the ontology. Extensive experiments demonstrate that our method achieves the promising performance for action recognition only using action names, while no attributes and positive exemplars are available.
Yu, S.I., Xu, S., Ma, Z., Li, H., Hauptmann, A.G., Chang, X., Yang, Y., Meng, D., Lin, M., Lan, Z., Gan, C., Xu, Z., Mao, Z., Li, X., Jiang, L. & Du, X. 2016, 'Strategies for searching video content with text queries or video examples', ITE Transactions on Media Technology and Applications, vol. 4, no. 3, pp. 227-238.
© 2016 by ITE Transactions on Media Technology and Applications (MTA). The large number of user-generated videos uploaded on to the Internet everyday has led to many commercial video search engines, which mainly rely on text metadata for search. However, metadata is often lacking for user-generated videos, thus these videos are unsearchable by current search engines. Therefore, content-based video retrieval (CBVR) tackles this metadata-scarcity problem by directly analyzing the visual and audio streams of each video. CBVR encompasses multiple research topics, including low-level feature design, feature fusion, semantic detector training and video search/reranking. We present novel strategies in these topics to enhance CBVR in both accuracy and speed under different query inputs, including pure textual queries and query by video examples. Our proposed strategies have been incorporated into our submission for the TRECVID 2014 Multimedia Event Detection evaluation, where our system outperformed other submissions in both text queries and video example queries, thus demonstrating the effectiveness of our proposed approaches.
Yan, Y., Nie, F., Li, W., Gao, C., Yang, Y. & Xu, D. 2016, 'Image Classification by Cross-Media Active Learning with Privileged Information', IEEE Transactions on Multimedia, vol. 18, no. 12, pp. 2494-2502.
View/Download from: Publisher's site
© 2016 IEEE. In this paper, we propose a novel cross-media active learning algorithm to reduce the effort on labeling images for training. The Internet images are often associated with rich textual descriptions. Even though such textual information is not available in test images, it is still useful for learning robust classifiers. In light of this, we apply the recently proposed supervised learning paradigm, learning using privileged information, to the active learning task. Specifically, we train classifiers on both visual features and privileged information, and measure the uncertainty of unlabeled data by exploiting the learned classifiers and slacking function. Then, we propose to select unlabeled samples by jointly measuring the cross-media uncertainty and the visual diversity. Our method automatically learns the optimal tradeoff parameter between the two measurements, which in turn makes our algorithms particularly suitable for real-world applications. Extensive experiments demonstrate the effectiveness of our approach.
Xia, Y., Nie, L., Zhang, L., Yang, Y., Hong, R. & Li, X. 2016, 'Weakly Supervised Multilabel Clustering and its Applications in Computer Vision', IEEE Transactions on Cybernetics, vol. 46, no. 12, pp. 3220-3232.
View/Download from: Publisher's site
© 2016 IEEE. Clustering is a useful statistical tool in computer vision and machine learning. It is generally accepted that introducing supervised information brings remarkable performance improvement to clustering. However, assigning accurate labels is expensive when the amount of training data is huge. Existing supervised clustering methods handle this problem by transferring the bag-level labels into the instance-level descriptors. However, the assumption that each bag has a single label limits the application scope seriously. In this paper, we propose weakly supervised multilabel clustering, which allows assigning multiple labels to a bag. Based on this, the instance-level descriptors can be clustered with the guidance of bag-level labels. The key technique is a weakly supervised random forest that infers the model parameters. Thereby, a deterministic annealing strategy is developed to optimize the nonconvex objective function. The proposed algorithm is efficient in both the training and the testing stages. We apply it to three popular computer vision tasks: 1) image clustering; 2) semantic image segmentation; and 3) multiple objects localization. Impressive performance on the state-of-the-art image data sets is achieved in our experiments.
Zhang, L., Yang, Y., Nie, F. & Shao, L. 2016, 'Guest editors' introduction: Perception, Aesthetics, and Emotion in Multimedia Quality Modeling', IEEE MultiMedia, vol. 23, no. 3, pp. 20-22.
View/Download from: Publisher's site
Han, Y., Yang, Y. & Zhou, X. 2016, 'Guest editorial: web multimedia semantic inference using multi-cues', WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS, vol. 19, no. 2, pp. 177-179.
View/Download from: Publisher's site
Chang, X., Nie, F., Yang, Y., Zhang, C. & Huang, H. 2016, 'Convex Sparse PCA for Unsupervised Feature Learning', ACM Transactions on Knowledge Discovery from Data, vol. 11, no. 1, pp. 1-16.
View/Download from: UTS OPUS or Publisher's site
Principal component analysis (PCA) has been widely applied to dimensionality reduction and data preprocessing for different applications in engineering, biology, social science, and the like. Classical PCA and its variants seek for linear projections of the original variables to obtain the low-dimensional feature representations with maximal variance. One limitation is that it is difficult to interpret the results of PCA. Besides, the classical PCA is vulnerable to certain noisy data. In this paper, we propose a Convex Sparse Principal Component Analysis (CSPCA) algorithm and apply it to feature learning. First, we show that PCA can be formulated as a low-rank regression optimization problem. Based on the discussion, the l2,1-norm minimization is incorporated into the objective function to make the regression coefficients sparse, thereby robust to the outliers. Also, based on the sparse model used in CSPCA, an optimal weight is assigned to each of the original feature, which in turn provides the output with good interpretability. With the output of our CSPCA, we can effectively analyze the importance of each feature under the PCA criteria. Our new objective function is convex, and we propose an iterative algorithm to optimize it. We apply the CSPCA algorithm to feature selection and conduct extensive experiments on seven benchmark datasets. Experimental results demons
Wu, F., Fang, H., Li, X., Tang, S., Lu, W., Yang, Y., Zhu, W. & Zhuang, Y. 2016, 'Aspect Learning for Multimedia Summarization via Nonparametric Bayesian', IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, vol. 26, no. 10, pp. 1931-1942.
View/Download from: Publisher's site
Han, Y. & Yang, Y. 2016, 'Guest editorial: Adaptation methods for multimedia analysis', Neurocomputing, vol. 173, pp. 81-82.
View/Download from: Publisher's site
Chang, X., Yu, Y.-.L., Yang, Y. & Xing, E.P. 2016, 'Semantic Pooling for Complex Event Analysis in Untrimmed Videos', IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1-1.
View/Download from: Publisher's site
Zhang, L., Yang, Y., Nie, F. & Shao, L. 2016, 'Perception, Aesthetics, and Emotion in Multimedia Quality Modeling Introduction', IEEE MULTIMEDIA, vol. 23, no. 3, pp. 20-22.
Chang, X., Nie, F., Ma, Z., Yang, Y. & Zhou, X. 2015, 'A convex formulation for spectral shrunk clustering', Proceedings of the National Conference on Artificial Intelligence, vol. 4, pp. 2532-2538.
Copyright © 2015, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Spectral clustering is a fundamental technique in the field of data mining and information processing. Most existing spectral clustering algorithms integrate dimensionality reduction into the clustering process assisted by manifold learning in the original space. However, the manifold in reduced-dimensional subspace is likely to exhibit altered properties in contrast with the original space. Thus, applying manifold information obtained from the original space to the clustering process in a low-dimensional subspace is prone to inferior performance. Aiming to address this issue, we propose a novel convex algorithm that mines the manifold structure in the low-dimensional subspace. In addition, our unified learning process makes the manifold learning particularly tailored for the clustering. Compared with other related methods, the proposed algorithm results in more structured clustering result. To validate the efficacy of the proposed algorithm, we perform extensive experiments on several benchmark datasets in comparison with some state-of-the-art clustering approaches. The experimental results demonstrate that the proposed algorithm has quite promising clustering performance.
Yang, Y., Ma, Z., Nie, F., Chang, X. & Hauptmann, A.G. 2015, 'Multi-Class Active Learning by Uncertainty Sampling with Diversity Maximization', International Journal of Computer Vision, vol. 113, no. 2.
View/Download from: Publisher's site
As a way to relieve the tedious work of manual annotation, active learning plays important roles in many applications of visual concept recognition. In typical active learning scenarios, the number of labelled data in the seed set is usually small. However, most existing active learning algorithms only exploit the labelled data, which often suffers from over-fitting due to the small number of labelled examples. Besides, while much progress has been made in binary class active learning, little research attention has been focused on multi-class active learning. In this paper, we propose a semi-supervised batch mode multi-class active learning algorithm for visual concept recognition. Our algorithm exploits the whole active pool to evaluate the uncertainty of the data. Considering that uncertain data are always similar to each other, we propose to make the selected data as diverse as possible, for which we explicitly impose a diversity constraint on the objective function. As a multi-class active learning algorithm, our algorithm is able to exploit uncertainty across multiple classes. An efficient algorithm is used to optimize the objective function. Extensive experiments on action recognition, object classification, scene recognition, and event detection demonstrate its advantages.
Yan, Y., Yang, Y., Meng, D., Liu, G., Tong, W., Hauptmann, A.G. & Sebe, N. 2015, 'Event Oriented Dictionary Learning for Complex Event Detection', IEEE Transactions on Image Processing, vol. 24, no. 6, pp. 1867-1878.
View/Download from: Publisher's site
© 1992-2012 IEEE. Complex event detection is a retrieval task with the goal of finding videos of a particular event in a large-scale unconstrained Internet video archive, given example videos and text descriptions. Nowadays, different multimodal fusion schemes of low-level and high-level features are extensively investigated and evaluated for the complex event detection task. However, how to effectively select the high-level semantic meaningful concepts fr om a large pool to assist complex event detection is rarely studied in the literature. In this paper, we propose a novel strategy to automatically select semantic meaningful concepts for the event detection task based on both the events-kit text descriptions and the concepts high-level feature descriptions. Moreover, we introduce a novel event oriented dictionary representation based on the selected semantic concepts. Toward this goal, we leverage training images (frames) of selected concepts from the semantic indexing dataset with a pool of 346 concepts, into a novel supervised multitask p -norm dictionary learning framework. Extensive experimental results on TRECVID multimedia event detection dataset demonstrate the efficacy of our proposed method.
Yang, Y., Ma, Z., Yang, Y., Nie, F. & Shen, H.T. 2015, 'Multitask spectral clustering by exploring intertask correlation', IEEE Transactions on Cybernetics, vol. 45, no. 5, pp. 1069-1080.
View/Download from: Publisher's site
&copy; 2014 IEEE. Clustering, as one of the most classical research problems in pattern recognition and data mining, has been widely explored and applied to various applications. Due to the rapid evolution of data on the Web, more emerging challenges have been posed on traditional clustering techniques: 1) correlations among related clustering tasks and/or within individual task are not well captured; 2) the problem of clustering out-of-sample data is seldom considered; and 3) the discriminative property of cluster label matrix is not well explored. In this paper, we propose a novel clustering model, namely multitask spectral clustering (MTSC), to cope with the above challenges. Specifically, two types of correlations are well considered: 1) intertask clustering correlation, which refers the relations among different clustering tasks and 2) intratask learning correlation, which enables the processes of learning cluster labels and learning mapping function to reinforce each other. We incorporate a novel < inf > 2,p < /inf > -norm regularizer to control the coherence of all the tasks based on an assumption that related tasks should share a common low-dimensional representation. Moreover, for each individual task, an explicit mapping function is simultaneously learnt for predicting cluster labels by mapping features to the cluster label matrix. Meanwhile, we show that the learning process can naturally incorporate discriminative information to further improve clustering performance. We explore and discuss the relationships between our proposed model and several representative clustering techniques, including spectral clustering, k-means and discriminative k-means. Extensive experiments on various real-world datasets illustrate the advantage of the proposed MTSC model compared to state-of-the-art clustering approaches.
Han, Y., Yang, Y., Yan, Y., Ma, Z., Sebe, N. & Zhou, X. 2015, 'Semisupervised feature selection via spline regression for video semantic recognition', IEEE Transactions on Neural Networks and Learning Systems, vol. 26, no. 2, pp. 252-264.
View/Download from: Publisher's site
&copy; 2014 IEEE. To improve both the efficiency and accuracy of video semantic recognition, we can perform feature selection on the extracted video features to select a subset of features from the high-dimensional feature set for a compact and accurate video data representation. Provided the number of labeled videos is small, supervised feature selection could fail to identify the relevant features that are discriminative to target classes. In many applications, abundant unlabeled videos are easily accessible. This motivates us to develop semisupervised feature selection algorithms to better identify the relevant video features, which are discriminative to target classes by effectively exploiting the information underlying the huge amount of unlabeled video data. In this paper, we propose a framework of video semantic recognition by semisupervised feature selection via spline regression (S 2 FS 2 R). Two scatter matrices are combined to capture both the discriminative information and the local geometry structure of labeled and unlabeled training videos: A within-class scatter matrix encoding discriminative information of labeled training videos and a spline scatter output from a local spline regression encoding data distribution. An 2,1 -norm is imposed as a regularization term on the transformation matrix to ensure it is sparse in rows, making it particularly suitable for feature selection. To efficiently solve S 2 FS 2 R, we develop an iterative algorithm and prove its convergency. In the experiments, three typical tasks of video semantic recognition, such as video concept detection, video classification, and human action recognition, are used to demonstrate that the proposed S 2 FS 2 R achieves better performance compared with the state-of-the-art methods.
Han, Y., Yang, Y., Wu, F. & Hong, R. 2015, 'Compact and Discriminative Descriptor Inference Using Multi-Cues', IEEE Transactions on Image Processing, vol. 24, no. 12, pp. 5114-5126.
View/Download from: Publisher's site
&copy; 1992-2012 IEEE. Feature descriptors around local interest points are widely used in human action recognition both for images and videos. However, each kind of descriptors describes the local characteristics around the reference point only from one cue. To enhance the descriptive and discriminative ability from multiple cues, this paper proposes a descriptor learning framework to optimize the descriptors at the source by learning a projection from multiple descriptors' spaces to a new Euclidean space. In this space, multiple cues and characteristics of different descriptors are fused and complemented for each other. In order to make the new descriptor more discriminative, we learn the multi-cue projection by the minimization of the ratio of within-class scatter to between-class scatter, and therefore, the discriminative ability of the projected descriptor is enhanced. In the experiment, we evaluate our framework on the tasks of action recognition from still images and videos. Experimental results on two benchmark image and two benchmark video data sets demonstrate the effectiveness and better performance of our method.
Han, Y., Yang, Y. & Wang, J. 2015, 'Guest Editorial: Ad Hoc Web Multimedia Analysis with Limited Supervision', MULTIMEDIA TOOLS AND APPLICATIONS, vol. 74, no. 2, pp. 463-465.
View/Download from: Publisher's site
Ma, Z., Yang, Y., Sebe, N. & Hauptmann, A.G. 2014, 'Knowledge adaptation with partially shared features for event detection using few exemplars', IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 9, pp. 1789-1802.
View/Download from: Publisher's site
Multimedia event detection (MED) is an emerging area of research. Previous work mainly focuses on simple event detection in sports and news videos, or abnormality detection in surveillance videos. In contrast, we focus on detecting more complicated and generic events that gain more users' interest, and we explore an effective solution for MED. Moreover, our solution only uses few positive examples since precisely labeled multimedia content is scarce in the real world. As the information from these few positive examples is limited, we propose using knowledge adaptation to facilitate event detection. Different from the state of the art, our algorithm is able to adapt knowledge from another source for MED even if the features of the source and the target are partially different, but overlapping. Avoiding the requirement that the two domains are consistent in feature types is desirable as data collection platforms change or augment their capabilities and we should be able to respond to this with little or no effort. We perform extensive experiments on real-world multimedia archives consisting of several challenging events. The results show that our approach outperforms several other state-of-the-art detection algorithms. &copy; 1979-2012 IEEE.
Song, J., Yang, Y., Li, X., Huang, Z. & Yang, Y. 2014, 'Robust Hashing with local models for approximate similarity search', IEEE Transactions on Cybernetics, vol. 44, no. 7, pp. 1225-1236.
View/Download from: Publisher's site
Similarity search plays an important role in many applications involving high-dimensional data. Due to the known dimensionality curse, the performance of most existing indexing structures degrades quickly as the feature dimensionality increases. Hashing methods, such as locality sensitive hashing (LSH) and its variants, have been widely used to achieve fast approximate similarity search by trading search quality for efficiency. However, most existing hashing methods make use of randomized algorithms to generate hash codes without considering the specific structural information in the data. In this paper, we propose a novel hashing method, namely, robust hashing with local models (RHLM), which learns a set of robust hash functions to map the high-dimensional data points into binary hash codes by effectively utilizing local structural information. In RHLM, for each individual data point in the training dataset, a local hashing model is learned and used to predict the hash codes of its neighboring data points. The local models from all the data points are globally aligned so that an optimal hash code can be assigned to each data point. After obtaining the hash codes of all the training data points, we design a robust method by employing 2,1 -norm minimization on the loss function to learn effective hash functions, which are then used to map each database point into its hash code. Given a query data point, the search process first maps it into the query hash code by the hash functions and then explores the buckets, which have similar hash codes to the query hash code. Extensive experimental results conducted on real-life datasets show that the proposed RHLM outperforms the state-of-the-art methods in terms of search quality and efficiency. &copy; 2013 IEEE.
Ma, Z., Yang, Y., Nie, F., Sebe, N., Yan, S. & Hauptmann, A.G. 2014, 'Harnessing Lab Knowledge for Real-World Action Recognition', INTERNATIONAL JOURNAL OF COMPUTER VISION, vol. 109, no. 1-2, pp. 60-73.
View/Download from: Publisher's site
Han, Y., Yang, Y., Ma, Z., Shen, H., Sebe, N. & Zhou, X. 2014, 'Image attribute adaptation', IEEE Transactions on Multimedia, vol. 16, no. 4, pp. 1115-1126.
View/Download from: Publisher's site
Visual attributes can be considered as a middle-level semantic cue that bridges the gap between low-level image features and high-level object classes. Thus, attributes have the advantage of transcending specific semantic categories or describing objects across categories. Since attributes are often human-nameable and domain specific, much work constructs attribute annotations ad hoc or take them from an application-dependent ontology. To facilitate other applications with attributes, it is necessary to develop methods which can adapt a well-defined set of attributes to novel images. In this paper, we propose a framework for image attribute adaptation. The goal is to automatically adapt the knowledge of attributes from a well-defined auxiliary image set to a target image set, thus assisting in predicting appropriate attributes for target images. In the proposed framework, we use a non-linear mapping function corresponding to multiple base kernels to map each training images of both the auxiliary and the target sets to a Reproducing Kernel Hilbert Space (RKHS), where we reduce the mismatch of data distributions between auxiliary and target images. In order to make use of un-labeled images, we incorporate a semi-supervised learning process. We also introduce a robust loss function into our framework to remove the shared irrelevance and noise of training images. Experiments on two couples of auxiliary-target image sets demonstrate that the proposed framework has better performance of predicting attributes for target testing images, compared to three baselines and two state-of-the-art domain adaptation methods. &copy; 2014 IEEE.
Tong, W., Yang, Y., Jiang, L., Yu, S.I., Lan, Z., Ma, Z., Sze, W., Younessian, E. & Hauptmann, A.G. 2014, 'E-LAMP: Integration of innovative ideas for multimedia event detection', Machine Vision and Applications, vol. 25, no. 1, pp. 5-15.
View/Download from: Publisher's site
Detecting multimedia events in web videos is an emerging hot research area in the fields of multimedia and computer vision. In this paper, we introduce the core methods and technologies of the framework we developed recently for our Event Labeling through Analytic Media Processing (E-LAMP) system to deal with different aspects of the overall problem of event detection. More specifically, we have developed efficient methods for feature extraction so that we are able to handle large collections of video data with thousands of hours of videos. Second, we represent the extracted raw features in a spatial bag-of-words model with more effective tilings such that the spatial layout information of different features and different events can be better captured, thus the overa ll detection performance can be improved. Third, different from widely used early and late fusion schemes, a novel algorithm is developed to learn a more robust and discriminative intermediate feature representation from multiple features so that better event models can be built upon it. Finally, to tackle the additional challenge of event detection with only very few positive exemplars, we have developed a novel algorithm which is able to effectively adapt the knowledge learnt from auxiliary sources to assist the event detection. Both our empirical results and the official evaluation results on TRECVID MED'11 and MED'12 demonstrate the excellent performance of the integration of these ideas. &copy; 2013 Springer-Verlag Berlin Heidelberg.
Wu, F., Yu, Z., Yang, Y., Tang, S., Zhang, Y. & Zhuang, Y. 2014, 'Sparse Multi-Modal Hashing', IEEE TRANSACTIONS ON MULTIMEDIA, vol. 16, no. 2, pp. 427-439.
View/Download from: Publisher's site
Zhang, L., Yang, Y., Gao, Y., Yu, Y., Wang, C. & Li, X. 2014, 'A probabilistic associative model for segmenting weakly supervised images', IEEE Transactions on Image Processing, vol. 23, no. 9, pp. 4150-4159.
View/Download from: Publisher's site
Weakly supervised image segmentation is an important yet challenging task in image processing and pattern recognition fields. It is defined as: in the training stage, semantic labels are only at the image-level, without regard to their specific object/scene location within the image. Given a test image, the goal is to predict the semantics of every pixel/superpixel. In this paper, we propose a new weakly supervised image segmentation model, focusing on learning the semantic associations between superpixel sets (graphlets in this paper). In particular, we first extract graphlets from each image, where a graphlet is a small-sized graph measures the potential of multiple spatially neighboring superpixels (i.e., the probability of these superpixels sharing a common semantic label, such as the sky or the sea). To compare different-sized graphlets and to incorporate image-level labels, a manifold embedding algorithm is designed to transform all graphlets into equal-length feature vectors. Finally, we present a hierarchical Bayesian network to capture the semantic associations between postembedding graphlets, based on which the semantics of each superpixel is inferred accordingly. Experimental results demonstrate that: 1) our approach performs competitively compared with the state-of-the-art approaches on three public data sets and 2) considerable performance enhancement is achieved when using our approach on segmentation-based photo cropping and image categorization. &copy; 2014 IEEE.
Han, Y., Wei, X., Cao, X., Yang, Y. & Zhou, X. 2014, 'Augmenting image descriptions using structured prediction output', IEEE Transactions on Multimedia, vol. 16, no. 6, pp. 1665-1676.
View/Download from: Publisher's site
&copy; 2014 IEEE. The need for richer descriptions of images arises in a wide spectrum of applications ranging from image understanding to image retrieval. While the Automatic Image Annotation (AIA) has been extensively studied, image descriptions with the output labels lack sufficient information. This paper proposes to augment image descriptions using structured prediction output. We define a hierarchical tree-structured semantic unit to describe images, from which we can obtain not only the class and subclass one image belongs to, but also the attributes one image has. After defining a new feature map function of structured SVM, we decompose the loss function into every node of the hierarchical tree-structured semantic unit and then predict the tree-structured semantic unit for testing images. In the experiments, we evaluate the performance of the proposed method on two open benchmark datasets and compare with the state-of-the-art methods. Experimental results show the better prediction performance of the proposed method and demonstrate the strength of augmenting image descriptions.
Liu, J., Yang, Y., Huang, Z., Yang, Y. & Shen, H.T. 2014, 'On the influence propagation of web videos', IEEE Transactions on Knowledge and Data Engineering, vol. 26, no. 8, pp. 1961-1973.
View/Download from: Publisher's site
We propose a novel approach to analyze how a popular video is propagated in the cyberspace, to identify if it originated from a certain sharing-site, and to identify how it reached the current popularity in its propagation. In addition, we also estimate their influences across different websites outside the major hosting website. Web video is gaining significance due to its rich and eye-ball grabbing content. This phenomenon is evidently amplified and accelerated by the advance of Web 2.0. When a video receives some degree of popularity, it tends to appear on various websites including not only video-sharing websites but also news websites, social networks or even Wikipedia. Numerous video-sharing websites have hosted videos that reached a phenomenal level of visibility and popularity in the entire cyberspace. As a result, it is becoming more difficult to determine how the propagation took place-was the video a piece of original work that was intentionally uploaded to its major hosting site by the authors, or did the video originate from some small site then reached the sharing site after already getting a good level of popularity, or did it originate from other places in the cyberspace but the sharing site made it popular. Existing study regarding this flow of influence is lacking. Literature that discuss the problem of estimating a video's influence in the whole cyberspace also remains rare. In this article we introduce a novel framework to identify the propagation of popular videos from its major hosting site's perspective, and to estimate its influence. We define a Unified Virtual Community Space (UVCS) to model the propagation and influence of a video, and devise a novel learning method called Noise-reductive Local-and-Global Learning (NLGL) to effectively estimate a video's origin and influence. Without losing generality, we conduct experiments on annotated dataset collected from a major video sharing site to evaluate the effectiveness of the framework. Sur...
Li, Z., Liu, J., Yang, Y., Zhou, X. & Lu, H. 2014, 'Clustering-Guided Sparse Structural Learning for Unsupervised Feature Selection', IEEE Transactions on Knowledge and Data Engineering, vol. 26, no. 9, pp. 2138-2150.
View/Download from: Publisher's site
&copy; 1989-2012 IEEE. Many pattern analysis and data mining problems have witnessed high-dimensional data represented by a large number of features, which are often redundant and noisy. Feature selection is one main technique for dimensionality reduction that involves identifying a subset of the most useful features. In this paper, a novel unsupervised feature selection algorithm, named clustering-guided sparse structural learning (CGSSL), is proposed by integrating cluster analysis and sparse structural analysis into a joint framework and experimentally evaluated. Nonnegative spectral clustering is developed to learn more accurate cluster labels of the input samples, which guide feature selection simultaneously. Meanwhile, the cluster labels are also predicted by exploiting the hidden structure shared by different features, which can uncover feature correlations to make the results more reliable. Row-wise sparse models are leveraged to make the proposed model suitable for feature selection. To optimize the proposed formulation, we propose an efficient iterative algorithm. Finally, extensive experiments are conducted on 12 diverse benchmarks, including face data, handwritten digit data, document data, and biomedical data. The encouraging experimental results in comparison with several representative algorithms and the theoretical analysis demonstrate the efficiency and effectiveness of the proposed algorithm for feature selection.
Wang, S., Ma, Z., Yang, Y., Li, X., Pang, C. & Hauptmann, A.G. 2014, 'Semi-supervised multiple feature analysis for action recognition', IEEE Transactions on Multimedia, vol. 16, no. 2, pp. 289-298.
View/Download from: Publisher's site
This paper presents a semi-supervised method for categorizing human actions using multiple visual features. The proposed algorithm simultaneously learns multiple features from a small number of labeled videos, and automatically utilizes data distributions between labeled and unlabeled data to boost the recognition performance. Shared structural analysis is applied in our approach to discover a common subspace shared by each type of feature. In the subspace, the proposed algorithm is able to characterize more discriminative information of each feature type. Additionally, data distribution information of each type of feature has been preserved. The aforementioned attributes make our algorithm robust for action recognition, especially when only limited labeled training samples are provided. Extensive experiments have been conducted on both the choreographed and the realistic video datasets, including KTH, Youtube action and UCF50. Experimental results show that our method outperforms several state-of-the-art algorithms. Most notably, much better performances have been achieved when there are only a few labeled training samples. &copy; 1999-2012 IEEE.
Zhang, L., Song, M., Yang, Y., Zhao, Q., Zhao, C. & Sebe, N. 2014, 'Weakly supervised photo cropping', IEEE Transactions on Multimedia, vol. 16, no. 1, pp. 94-107.
View/Download from: Publisher's site
Photo cropping is widely used in the printing industry, photography, and cinematography. Conventional photo cropping methods suffer from three drawbacks: 1) the semantics used to describe photo aesthetics are determined by the experience of model designers and specific data sets, 2) image global configurations, an essential cue to capture photos aesthetics, are not well preserved in the cropped photo, and 3) multi-channel visual features from an image region contribute differently to human aesthetics, but state-of-the-art photo cropping methods cannot automatically weight them. Owing to the recent progress in image retrieval community, image-level semantics, i.e., photo labels obtained without much human supervision, can be efficiently and effectively acquired. Thus, we propose weakly supervised photo cropping, where a manifold embedding algorithm is developed to incorporate image-level semantics and image global configurations with graphlets, or, small-sized connected subgraph. After manifold embedding, a Bayesian Network (BN) is proposed. It incorporates the testing photo into the framework derived from the multi-channel post-embedding graphlets of the training data, the importance of which is determined automatically. Based on the BN, photo cropping can be casted as searching the candidate cropped photo that maximally preserves graphlets from the training photos, and the optimal cropping parameter is inferred by Gibbs sampling. Subjective evaluations demonstrate that: 1) our approach outperforms several representative photo cropping methods, including our previous cropping model that is guided by semantics-free graphlets, and 2) the visualized graphlets explicitly capture photo semantics and global spatial configurations. &copy; 1999-2012 IEEE.
Ji, R., Yang, Y., Sebe, N., Aizawa, K. & Cao, L. 2014, 'Large-scale geosocial multimedia', IEEE Multimedia, vol. 21, no. 3, pp. 7-9.
View/Download from: Publisher's site
With the advance of the Web 2.0 era came an explosive growth of geographical multimedia data shared on social network websites such as Flickr, YouTube, Facebook, and Zooomr. Location-aware media description, modeling, learning, and recommendation in pervasive social media analytics have become a key focus of the recent research in computer vision, multimedia, and signal processing societies. A new breed of multimedia applications that incorporates image/video annotation, visual search, content mining and recommendation, and so on may revolutionize the field. Combined with the popularity of location-aware social multimedia, location context data makes traditionally challenging problems more tractable. This special issue brings together active researchers to share recent progress in this exciting area. This issue highlights the latest developments in large-scale multiple evidence-based learning for geosocial multimedia computing and identifies several key challenges and potential innovations. &copy; 2014 IEEE.
Li, P., Bu, J., Yang, Y., Ji, R., Chen, C. & Cai, D. 2014, 'Discriminative Orthogonal Nonnegative matrix factorization with flexibility for data representation', Expert Systems with Applications, vol. 41, no. 4 PART 1, pp. 1283-1293.
View/Download from: Publisher's site
Learning an informative data representation is of vital importance in multidisciplinary applications, e.g., face analysis, document clustering and collaborative filtering. As a very useful tool, Nonnegative matrix factorization (NMF) is often employed to learn a well-structured data representation. While the geometrical structure of the data has been studied in some previous NMF variants, the existing works typically neglect the discriminant information revealed by the between-class scatter and the total scatter of the data. To address this issue, we present a novel approach named Discriminative Orthogonal Nonnegative matrix factorization (DON), which preserves both the local manifold structure and the global discriminant information simultaneously through manifold discriminant learning. In particular, to learn the discriminant structure for the data representation, we introduce the scaled indicator matrix, which naturally satisfies the orthogonality condition. Thus, we impose the orthogonality constraints on the objective function. However, too heavy constraints will lead to a very sparse data representation that is unexpected in reality. So we further make this orthogonality flexible. In addition, we provide the optimization framework with the convergence proof of the updating rules. Extensive comparisons over several state-of-the-art approaches demonstrate the efficacy of the proposed method. &copy; 2013 Elsevier Ltd. All rights reserved.
Liu, J., Zhang, P., Yu, T., Yang, Y. & Qiu, H. 2014, '[Effects of losartan on pulmonary dendritic cells in lipopolysaccharide- induced acute lung injury mice].', Zhonghua yi xue za zhi, vol. 94, no. 41, pp. 3216-3219.
OBJECTIVE: To assess the effects of losartan on the frequency and phenotype of respiratory dendritic cells (DC) in lipopolysaccharide (LPS)-induced acute lung injury (ALI) mice. METHODS: The C57BL/6 mice were randomly divided into 3 groups of control, ALI and ALI+losartan. ALI animals received 2 mg/kg of LPS; ALI+losartan animals 2 mg/kg of LPS and 15 mg/kg of losartan 30 min before an intratracheal injection of LPS; control animals phosphate buffer saline (PBS) instead of LPS. Lung wet weight/body weight (LW/BW) was recorded to assess lung injury. The pathological changes were examined under optical microscope. The frequency and phenotype of pulmonary DC were characterized by flow cytometry. Meanwhile, the levels of IL-6 in lung homogenates were assessed by enzyme-linked immunosorbent assay (ELISA). RESULTS: (1) The LPS-induced rise in LW/BW was partially prevented by a pretreatment of losartan. (2) Histologically, widespread alveolar wall thickening caused by edema, severe hemorrhage in interstitium and alveolus and marked and diffuse interstitial infiltration of inflammatory cells were observed in the ALI group. Whereas, losartan effectively attenuated the LPS-induced pulmonary hemorrhage, leukocytic infiltration in interstitium and alveolus. (3) Meanwhile, the levels of IL-6 in lung tissue were significantly enhanced in the LPS-induced ALI mice. Yet after a pretreatment of losartan, the pulmonary level of IL-6 markedly decreased. (4) LPS dosing resulted in a rapid accumulation of DC in lung tissues and an up-regulated expression of CD80 in LPS-induced ALI. In contrast, the expression of MHC II on respiratory DC was not significantly different among groups. A pretreatment of losartan led to a marked reduction in CD80 expression on pulmonary DC (P < 0.05 vs ALI). CONCLUSION: Losartan may down-regulate pulmonary injury by inhibiting the activation of pulmonary DC.
Ji, R., Yang, Y., Sebe, N., Aizawa, K. & Cao, L. 2014, 'Large-Scale Geosocial Multimedia Introduction', IEEE MULTIMEDIA, vol. 21, no. 3, pp. 7-9.
Mu, Y., Yang, Y., Cao, L., Yan, S. & Tian, Q. 2014, 'Guest Editorial: Special issue on large scale multimedia semantic indexing', COMPUTER VISION AND IMAGE UNDERSTANDING, vol. 124, pp. 1-2.
View/Download from: Publisher's site
Yang, Y., Sebe, N., Snoek, C., Hua, X.-.S. & Zhuang, Y. 2014, 'Special section on learning from multiple evidences for large scale multimedia analysis', COMPUTER VISION AND IMAGE UNDERSTANDING, vol. 118, pp. 1-1.
View/Download from: Publisher's site
Yang, Y., Yang, Y., Shen, H.T., Zhang, Y., Du, X. & Zhou, X. 2013, 'Discriminative nonnegative spectral clustering with out-of-sample extension', IEEE Transactions on Knowledge and Data Engineering, vol. 25, no. 8, pp. 1760-1771.
View/Download from: Publisher's site
Data clustering is one of the fundamental research problems in data mining and machine learning. Most of the existing clustering methods, for example, normalized cut and k-means, have been suffering from the fact that their optimization processes normally lead to an NP-hard problem due to the discretization of the elements in the cluster indicator matrix. A practical way to cope with this problem is to relax this constraint to allow the elements to be continuous values. The eigenvalue decomposition can be applied to generate a continuous solution, which has to be further discretized. However, the continuous solution is probably mixingsigned. This result may cause it deviate severely from the true solution, which should be naturally nonnegative. In this paper, we propose a novel clustering algorithm, i.e., discriminative nonnegative spectral clustering, to explicitly impose an additional nonnegative constraint on the cluster indicator matrix to seek for a more interpretable solution. Moreover, we show an effective regularization term which is able to not only provide more useful discriminative information but also learn a mapping function to predict cluster labels for the out-of-sample test data. Extensive experiments on various data sets illustrate the superiority of our proposal compared to the state-ofthe- art clustering algorithms. &copy; 2012 IEEE.
Yang, Y., Song, J., Huang, Z., Ma, Z., Sebe, N. & Hauptman, A.G. 2013, 'Multi-feature fusion via hierarchical regression for multimedia analysis', IEEE Transactions on Multimedia, vol. 15, no. 3, pp. 572-581.
View/Download from: Publisher's site
Multimedia data are usually represented by multiple features. In this paper, we propose a new algorithm, namely Multi-feature Learning via Hierarchical Regression for multimedia semantics understanding, where two issues are considered. First, labeling large amount of training data is labor-intensive. It is meaningful to effectively leverage unlabeled data to facilitate multimedia semantics understanding. Second, given that multimedia data can be represented by multiple features, it is advantageous to develop an algorithm which combines evidence obtained from different features to infer reliable multimedia semantic concept classifiers. We design a hierarchical regression model to exploit the information derived from each type of feature, which is then collaboratively fused to obtain a multimedia semantic concept classifier. Both label information and data distribution of different features representing multimedia data are considered. The algorithm can be applied to a wide range of multimedia applications and experiments are conducted on video data for video concept annotation and action recognition. Using Trecvid and CareMedia video datasets, the experimental results show that it is beneficial to combine multiple features. The performance of the proposed algorithm is remarkable when only a small amount of labeled training data are available. &copy; 2012 IEEE.
Yang, Y., Ma, Z., Hauptmann, A.G. & Sebe, N. 2013, 'Feature selection for multimedia analysis by sharing information among multiple tasks', IEEE Transactions on Multimedia, vol. 15, no. 3, pp. 661-669.
View/Download from: Publisher's site
While much progress has been made to multi-task classification and subspace learning, multi-task feature selection has long been largely unaddressed. In this paper, we propose a new multi-task feature selection algorithm and apply it to multimedia (e.g., video and image) analysis. Instead of evaluating the importance of each feature individually, our algorithm selects features in a batch mode, by which the feature correlation is considered. While feature selection has received much research attention, less effort has been made on improving the performance of feature selection by leveraging the shared knowledge from multiple related tasks. Our algorithm builds upon the assumption that different related tasks have common structures. Multiple feature selection functions of different tasks are simultaneously learned in a joint framework, which enables our algorithm to utilize the common knowledge of multiple tasks as supplementary information to facilitate decision making. An efficient iterative algorithm is proposed to optimize it, whose convergence is guaranteed. Experiments on different databases have demonstrated the effectiveness of the proposed algorithm. &copy; 2012 IEEE.
Cao, X., Wei, X., Han, Y., Yang, Y., Sebe, N. & Hauptmann, A. 2013, 'Unified Dictionary Learning and Region Tagging with Hierarchical Sparse Representation', Computer Vision and Image Understanding, vol. 117, no. 8, pp. 934-946.
View/Download from: Publisher's site
Image patterns at different spatial levels are well organized, such as regions within one image and feature points within one region. These classes of spatial structures are hierarchical in nature. The appropriate integration and utilization of such relationship are important to improve the performance of region tagging. Inspired by the recent advances of sparse coding methods, we propose an approach, called Unified Dictionary Learning and Region Tagging with Hierarchical Sparse Representation. This approach consists of two steps: region representation and region reconstruction. In the first step, rather than using the 1 -norm as it is commonly done in sparse coding, we add a hierarchical structure to the process of sparse coding and form a framework of tree-guided dictionary learning. In this framework, the hierarchical structures among feature points, regions, and images are encoded by forming a tree-guided multi-task learning process. With the learned dictionary, we obtain a better representation of training and testing regions. In the second step, we propose to use a sub-hierarchical structure to guide the sparse reconstruction for testing regions, i.e., the structure between re gions and images. Thanks to this hierarchy, the obtained reconstruction coefficients are more discriminate. Finally, tags are propagated to testing regions by the learned reconstruction coefficients. Extensive experiments on three public benchmark image data sets demonstrate that the proposed approach has better performance of region tagging than the current state of the art methods. &copy; 2013 Elsevier Inc. All rights reserved.
Yang, Y., Yang, Y. & Shen, H.T. 2013, 'Effective transfer tagging from image to video', ACM Transactions on Multimedia Computing, Communications and Applications, vol. 9, no. 2.
View/Download from: Publisher's site
Recent years have witnessed a great explosion of user-generated videos on the Web. In order to achieve an effective and efficient video search, it is critical for modern video search engines to associate videos with semantic keywords automatically. Most of the existing video tagging methods can hardly achieve reliable performance due to deficiency of training data. It is noticed that abundant well-tagged data are available in other relevant types of media (e.g., images). In this article, we propose a novel video tagging framework, termed as Cross-Media Tag Transfer (CMTT), which utilizes the abundance of well-tagged images to facilitate video tagging. Specifically, we build a cross-media tunnel to transfer knowledge from images to videos. To this end, an optimal kernel space, in which distribution distance between images and video is minimized, is found to tackle the domainshift problem. A novel cross-media video tagging model is proposed to infer tags by exploring the intrinsic local structures of both labeled and unlabeled data, and learn reliable video classifiers. An efficient algorithm is designed to optimize the proposed model in an iterative and alternative way. Extensive experiments illustrate the superiority of our proposal compared to the state-of-the-art algorithms. &copy; 2013 ACM.
Liang, Z., Zhuang, Y., Yang, Y. & Xiao, J. 2013, 'Retrieval-based cartoon gesture recognition and applications via semi-supervised heterogeneous classifiers learning', Pattern Recognition, vol. 46, no. 1, pp. 412-423.
View/Download from: Publisher's site
2D cartoon plays an important role in many areas, but it requires effective methods to relieve manual labors. In this paper, we propose a heterogeneous cartoon gesture recognition method with applications. Firstly, heterogeneous features with different dimensions are assigned to express cartoon and human-subject images according to their characteristics. Then for recognition, we simultaneously integrate shared structure learning (SSL) and graph-based transductive learning into a joint framework to learn reliable classifiers on heterogeneous features. Provided with the framework, the similarities between cartoon and human-subject gestures can be quantitatively evaluated in a cross-feature manner. Extensive experiments on self-defined datasets have demonstrated the effectiveness of our method. Finally, applications illustrate the usages in various aspects of 2D cartoon industry. &copy; 2012 Elsevier Ltd All rights reserved.
Gao, C., Meng, D., Yang, Y., Wang, Y., Zhou, X. & Hauptmann, A.G. 2013, 'Infrared patch-image model for small target detection in a single image', IEEE Transactions on Image Processing, vol. 22, no. 12, pp. 4996-5009.
View/Download from: Publisher's site
The robust detection of small targets is one of the key techniques in infrared search and tracking applications. A novel small target detection method in a single infrared image is proposed in this paper. Initially, the traditional infrared image model is generalized to a new infrared patch-image model using local patch construction. Then, because of the non-local self-correlation property of the infrared background image, based on the new model small target detection is formulated as an optimization problem of recovering low-rank and sparse matrices, which is effectively solved using stable principle component pursuit. Finally, a simple adaptive segmentation method is used to segment the target image and the segmentation result can be refined by post-processing. Extensive synthetic and real data experiments show that under different clutter backgrounds the proposed method not only works more stably for different target sizes and signal-to-clutter ratio values, but also has better detection performance compared with conventional baseline methods. &copy; 1992-2012 IEEE.
Zhang, L., Han, Y., Yang, Y., Song, M., Yan, S. & Tian, Q. 2013, 'Discovering discriminative graphlets for aerial image categories recognition', IEEE Transactions on Image Processing, vol. 22, no. 12, pp. 5071-5084.
View/Download from: Publisher's site
Recognizing aerial image categories is useful for scene annotation and surveillance. Local features have been demonstrated to be robust to image transformations, including occlusions and clutters. However, the geometric property of an aerial image (i.e., the topology and relative displacement of local features), which is key to discriminating aerial image categories, cannot be effectively represented by state-of-the-art generic visual descriptors. To solve this problem, we propose a recognition model that mines graphlets from aerial images, where graphlets are small connected subgraphs reflecting both the geometric property and color/texture distribution of an aerial image. More specifically, each aerial image is decomposed into a set of basic components (e.g., road and playground) and a region adjacency graph (RAG) is accordingly constructed to model their spatial interactions. Aerial image categories recognition can subsequently be casted as RAG-to-RAG matching. Based on graph theory, RAG-to-RAG matching is conducted by comparing all their respective graphlets. Because the number of graphlets is huge, we derive a manifold embedding algorithm to measure different-sized graphlets, after which we select graphlets that have highly discriminative and low redundancy topologies. Through quantizing the selected graphlets from each aerial image into a feature vector, we use support vector machine to discrimina te aerial image categories. Experimental results indicate that our method outperforms several state-of-the-art object/scene recognition models, and the visualized graphlets indicate that the discriminative patterns are discovered by our proposed approach. &copy; 1992-2012 IEEE.
Song, J., Yang, Y., Huang, Z., Shen, H.T. & Luo, J. 2013, 'Effective multiple feature hashing for large-scale near-duplicate video retrieval', IEEE Transactions on Multimedia, vol. 15, no. 8, pp. 1997-2008.
View/Download from: Publisher's site
Near-duplicate video retrieval (NDVR) has recently attracted much research attention due to the exponential growth of online videos. It has many applications, such as copyright protection, automatic video tagging and online video monitoring. Many existing approaches use only a single feature to represent a video for NDVR. However, a single feature is often insufficient to characterize the video content. Moreover, while the accuracy is the main concern in previous literatures, the scalability of NDVR algorithms for large scale video datasets has been rarely addressed. In this paper, we present a novel approach - Multiple Feature Hashing (MFH) to tackle both the accuracy and the scalability issues of NDVR. MFH preserves the local structural information of each individual feature and also globally considers the local structures for all the features to learn a group of hash functions to map the video keyframes into the Hamming space and generate a series of binary codes to represent the video dataset. We evaluate our approach on a public video dataset and a large scale video dataset consisting of 132,647 videos collected from YouTube by ourselves. This dataset has been released (http://itee.uq.edu.au/shenht/UQ-VIDEO/). The experimental results show that the proposed method outperforms the state-of-the-art techniques in both accuracy and efficiency. &copy; 2013 IEEE.
Ma, Z., Yang, Y., Sebe, N., Zheng, K. & Hauptmann, A.G. 2013, 'Multimedia event detection using a classifier-specific intermediate representation', IEEE Transactions on Multimedia, vol. 15, no. 7, pp. 1628-1637.
View/Download from: Publisher's site
Multimedia event detection (MED) plays an important role in many applications such as video indexing and retrieval. Current event detection works mainly focus on sports and news event detection or abnormality detection in surveillance videos. Differently, our research aims to detect more complicated and generic events within a longer video sequence. In the past, researchers have proposed using intermediate concept classifiers with concept lexica to help understand the videos. Yet it is difficult to judge how many and what concepts would be sufficient for the particular video analysis task. Additionally, obtaining robust semantic concept classifiers requires a large number of positive training examples, which in turn has high human annotation cost. In this paper, we propose an approach that exploits the external concepts-based videos and event-based videos simultaneously to learn an intermediate representation from video features. Our algorithm integrates the classifier inference and latent intermediate representation into a joint framework. The joint optimization of the intermediate representation and the classifier makes them mutually beneficial and reciprocal. Effectively, the intermediate representation and the classifier are tightly correlated. The classifier dependent intermediate representation not only accurately reflects the task semantics but is also more suitable for the specific classifier. Thus we have created a discriminative semantic analysis framework based on a tightly coupled intermediate representation. Extensive experiments on multimedia event detection using real-world videos demonstrate the effectiveness of the proposed approach. &copy; 2013 IEEE.
Yang, Y., Huang, Z., Yang, Y., Liu, J., Shen, H.T. & Luo, J. 2013, 'Local image tagging via graph regularized joint group sparsity', PATTERN RECOGNITION, vol. 46, no. 5, pp. 1358-1368.
View/Download from: Publisher's site
Feng, Y., Xiao, J., Zha, Z., Zhang, H. & Yang, Y. 2012, 'Active learning for social image retrieval using Locally Regressive Optimal Design', Neurocomputing, vol. 95, pp. 54-59.
View/Download from: UTS OPUS or Publisher's site
In this paper, we propose a novel active learning algorithm, called Locally Regressive Optimal Design (LROD), to improve the effectiveness of relevance feedback-based social image retrieval. Our algorithm assumes that for each data point, the label values of both this data point and its neighbors can be well estimated using a locally regressive function. Specifically, we adopt a local linear regression model to predict the label value of each data point in a local patch. The regularized local model predication error of the local patch is defined as our local loss function. Then, a unified objective function is proposed to minimize the summation of these local loss functions over all the data points, so that an optimal predicated label value can be assigned to each data point. Finally, we embed it into a semi-supervised learning framework to construct the final objective function. Experiment results on MSRA-MM2.0 database demonstrate the efficiency and effectiveness of the proposed algorithm for relevance feedback-based social image retrieval. &copy; 2012 Elsevier B.V..
Liu, Y., Wu, F., Yang, Y., Zhuang, Y. & Hauptmann, A.G. 2012, 'Spline regression hashing for fast image search.', IEEE transactions on image processing : a publication of the IEEE Signal Processing Society, vol. 21, no. 10, pp. 4480-4491.
View/Download from: UTS OPUS
Techniques for fast image retrieval over large databases have attracted considerable attention due to the rapid growth of web images. One promising way to accelerate image search is to use hashing technologies, which represent images by compact binary codewords. In this way, the similarity between images can be efficiently measured in terms of the Hamming distance between their corresponding binary codes. Although plenty of methods on generating hash codes have been proposed in recent years, there are still two key points that needed to be improved: 1) how to precisely preserve the similarity structure of the original data and 2) how to obtain the hash codes of the previously unseen data. In this paper, we propose our spline regression hashing method, in which both the local and global data similarity structures are exploited. To better capture the local manifold structure, we introduce splines developed in Sobolev space to find the local data mapping function. Furthermore, our framework simultaneously learns the hash codes of the training data and the hash function for the unseen data, which solves the out-of-sample problem. Extensive experiments conducted on real image datasets consisting of over one million images show that our proposed method outperforms the state-of-the-art techniques.
Yang, Y., Wu, F., Nie, F., Shen, H.T., Zhuang, Y. & Hauptmann, A.G. 2012, 'Web and personal image annotation by mining label correlation with relaxed visual graph embedding.', IEEE transactions on image processing : a publication of the IEEE Signal Processing Society, vol. 21, no. 3, pp. 1339-1351.
View/Download from: UTS OPUS
The number of digital images rapidly increases, and it becomes an important challenge to organize these resources effectively. As a way to facilitate image categorization and retrieval, automatic image annotation has received much research attention. Considering that there are a great number of unlabeled images available, it is beneficial to develop an effective mechanism to leverage unlabeled images for large-scale image annotation. Meanwhile, a single image is usually associated with multiple labels, which are inherently correlated to each other. A straightforward method of image annotation is to decompose the problem into multiple independent single-label problems, but this ignores the underlying correlations among different labels. In this paper, we propose a new inductive algorithm for image annotation by integrating label correlation mining and visual similarity mining into a joint framework. We first construct a graph model according to image visual features. A multilabel classifier is then trained by simultaneously uncovering the shared structure common to different labels and the visual graph embedded label prediction matrix for image annotation. We show that the globally optimal solution of the proposed framework can be obtained by performing generalized eigen-decomposition. We apply the proposed framework to both web image annotation and personal album labeling using the NUS-WIDE, MSRA MM 2.0, and Kodak image data sets, and the AUC evaluation metric. Extensive experiments on large-scale image databases collected from the web and personal album show that the proposed algorithm is capable of utilizing both labeled and unlabeled data for image annotation and outperforms other algorithms.
Yang, Y., Nie, F., Xu, D., Luo, J., Zhuang, Y. & Pan, Y. 2012, 'A multimedia retrieval framework based on semi-supervised ranking and relevance feedback.', IEEE transactions on pattern analysis and machine intelligence, vol. 34, no. 4, pp. 723-742.
View/Download from: UTS OPUS
We present a new framework for multimedia content analysis and retrieval which consists of two independent algorithms. First, we propose a new semi-supervised algorithm called ranking with Local Regression and Global Alignment (LRGA) to learn a robust Laplacian matrix for data ranking. In LRGA, for each data point, a local linear regression model is used to predict the ranking scores of its neighboring points. A unified objective function is then proposed to globally align the local models from all the data points so that an optimal ranking score can be assigned to each data point. Second, we propose a semi-supervised long-term Relevance Feedback (RF) algorithm to refine the multimedia data representation. The proposed long-term RF algorithm utilizes both the multimedia data distribution in multimedia feature space and the history RF information provided by users. A trace ratio optimization problem is then formulated and solved by an efficient algorithm. The algorithms have been applied to several content-based multimedia retrieval applications, including cross-media retrieval, image retrieval, and 3D motion/pose data retrieval. Comprehensive experiments on four data sets have demonstrated its advantages in precision, robustness, scalability, and computational efficiency.
Ma, Z., Nie, F., Yang, Y., Uijlings, J.R.R., Sebe, N. & Hauptmann, A.G. 2012, 'Discriminating joint feature analysis for multimedia data understanding', IEEE Transactions on Multimedia, vol. 14, no. 6, pp. 1662-1672.
View/Download from: UTS OPUS or Publisher's site
In this paper, we propose a novel semi-supervised feature analyzing framework for multimedia data understanding and apply it to three different applications: image annotation, video concept detection and 3-D motion data analysis. Our method is built upon two advancements of the state of the art: 1) l 2, 1 -norm regularized feature selection which can jointly select the most relevant features from all the data points. This feature selection approach was shown to be robust and efficient in literature as it considers the correlation between different features jointly when conducting feature selection; 2) manifold learning which analyzes the feature space by exploiting both labeled and unlabeled data. It is a widely used technique to extend many algorithms to semi-supervised scenarios for its capability of leveraging the manifold structure of multimedia data. The proposed method is able to learn a classifier for different applications by selecting the discriminating features closely related to the semantic concepts. The objective function of our method is non-smooth and difficult to solve, so we design an efficient iterative algorithm with fast convergence, thus making it applicable to practical applications. Extensive experiments on image annotation, video concept detection and 3-D motion data analysis are performed on different real-world data sets to demonstrate the effectiveness of our algorithm. &copy; 1999-2012 IEEE.
Zha, Z.J., Wang, M., Zheng, Y.T., Yang, Y., Hong, R. & Chua, T.S. 2012, 'Interactive video indexing with statistical active learning', IEEE Transactions on Multimedia, vol. 14, no. 1, pp. 17-27.
View/Download from: UTS OPUS or Publisher's site
Video indexing, also called video concept detection, has attracted increasing attentions from both academia and industry. To reduce human labeling cost, active learning has been introduced to video indexing recently. In this paper, we propose a novel active learning approach based on the optimum experimental design criteria in statistics. Different from existing optimum experimental design, our approach simultaneously exploits sample's local structure, and sample relevance, density, and diversity information, as well as makes use of labeled and unlabeled data. Specifically, we develop a local learning model to exploit the local structure of each sample. Our assumption is that for each sample, its label can be well estimated based on its neighbors. By globally aligning the local models from all the samples, we obtain a local learning regularizer, based on which a local learning regularized least square model is proposed. Finally, a unified sample selection approach is developed for interactive video indexing, which takes into account the sample relevance, density and diversity information, and sample efficacy in minimizing the parameter variance of the proposed local learning regularized least square model. We compare the performance between our approach and the state-of-the-art approaches on the TREC video retrieval evaluation (TRECVID) benchmark. We report superior performance from the proposed approach. &copy; 2006 IEEE.
Ma, Z., Nie, F., Yang, Y., Uijlings, J.R.R. & Sebe, N. 2012, 'Web Image Annotation Via Subspace-Sparsity Collaborated Feature Selection', IEEE TRANSACTIONS ON MULTIMEDIA, vol. 14, no. 4, pp. 1021-1030.
View/Download from: UTS OPUS or Publisher's site
Chen, C., Zhuang, Y., Nie, F., Yang, Y., Wu, F. & Xiao, J. 2011, 'Learning a 3D Human Pose Distance Metric from Geometric Pose Descriptor', IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, vol. 17, no. 11, pp. 1676-1689.
View/Download from: UTS OPUS or Publisher's site
Chen, C., Yang, Y., Nie, F. & Odobez, J.M. 2011, '3D human pose recovery from image by efficient visual feature selection', Computer Vision and Image Understanding, vol. 115, no. 3, pp. 290-299.
View/Download from: UTS OPUS or Publisher's site
In this paper we propose a new examplar-based approach to recover 3D human poses from monocular images. Given the visual feature of each frame, pose retrieval is first conducted in the examplar database to find relevant pose candidates. Then, dynamic programming is applied on the pose candidates to recover a continuous pose sequence. We make two contributions within this framework. First, we propose to use an efficient feature selection algorithm to select effective visual feature components. The task is formulated as a trace-ratio criterion which measures the score of the selected feature component subset, and the criterion is efficiently optimized to achieve the global optimum. The selected components are used instead of the original full feature set to improve the accuracy and efficiency of pose recovery. As second contribution, we propose to use sparse representation to retrieve the pose candidates, where the measured visual feature is expressed as a sparse linear combination of the examplars in the database. Sparse representation ensures that semantically similar poses have larger probability to be retrieved. The effectiveness of our approach is validated quantitatively through extensive evaluations on both synthetic and real data, and qualitatively by inspecting the results of the real time system we have implemented. &copy; 2010 Elsevier Inc. All rights reserved.
Yang, Y., Zhuang, Y., Tao, D., Xu, D., Yu, J. & Luo, J. 2010, 'Recognizing Cartoon Image Gestures for Retrieval and Interactive Cartoon Clip Synthesis', IEEE Transactions on Circuits and Systems for Video Technology, vol. 20, no. 12, pp. 1745-1756.
View/Download from: UTS OPUS or Publisher's site
In this paper, we propose a new method to recognize gestures of cartoon images with two practical applications, i.e., content-based cartoon image retrieval and interactive cartoon clip synthesis. Upon analyzing the unique properties of four types of features including global color histogram, local color histogram (LCH), edge feature (EF), and motion direction feature (MDF), we propose to employ different features for different purposes and in various phases. We use EF to define a graph and then refine its local structure by LCH. Based on this graph, we adopt a transductive learning algorithm to construct local patches for each cartoon image. A spectral method is then proposed to optimize the local structure of each patch and then align these patches globally. MDF is fused with EF and LCH and a cartoon gesture space is constructed for cartoon image gesture recognition. We apply the proposed method to content-based cartoon image retrieval and interactive cartoon clip synthesis. The experiments demonstrate the effectiveness of our method.
Yang, Y., Xu, D., Nie, F., Yan, S. & Zhuang, Y. 2010, 'Image clustering using local discriminant models and global integration.', IEEE transactions on image processing : a publication of the IEEE Signal Processing Society, vol. 19, no. 10, pp. 2761-2773.
In this paper, we propose a new image clustering algorithm, referred to as clustering using local discriminant models and global integration (LDMGI). To deal with the data points sampled from a nonlinear manifold, for each data point, we construct a local clique comprising this data point and its neighboring data points. Inspired by the Fisher criterion, we use a local discriminant model for each local clique to evaluate the clustering performance of samples within the local clique. To obtain the clustering result, we further propose a unified objective function to globally integrate the local models of all the local cliques. With the unified objective function, spectral relaxation and spectral rotation are used to obtain the binary cluster indicator matrix for all the samples. We show that LDMGI shares a similar objective function with the spectral clustering (SC) algorithms, e.g., normalized cut (NCut). In contrast to NCut in which the Laplacian matrix is directly calculated based upon a Gaussian function, a new Laplacian matrix is learnt in LDMGI by exploiting both manifold structure and local discriminant information. We also prove that K-means and discriminative K-means (DisKmeans) are both special cases of LDMGI. Extensive experiments on several benchmark image datasets demonstrate the effectiveness of LDMGI. We observe in the experiments that LDMGI is more robust to algorithmic parameter, when compared with NCut. Thus, LDMGI is more appealing for the real image clustering applications in which the ground truth is generally not available for tuning algorithmic parameters.
Wu, F., Wang, W., Yang, Y., Zhuang, Y. & Nie, F. 2010, 'Classification by semi-supervised discriminative regularization', Neurocomputing, vol. 73, no. 10-12, pp. 1641-1651.
View/Download from: Publisher's site
Linear discriminant analysis (LDA) is a well-known dimensionality reduction method which can be easily extended for data classification. Traditional LDA aims to preserve the separability of different classes and the compactness of the same class in the output space by maximizing the between-class covariance and simultaneously minimizing the within-class covariance. However, the performance of LDA usually deteriorates when labeled information is insufficient. In order to resolve this problem, semi-supervised learning can be used, among which, manifold regularization (MR) provides an elegant framework to learn from labeled and unlabeled data. However, MR tends to misclassify data near the boundaries of different clusters during classification. In this paper, we propose a novel method, referred to as semi-supervised discriminative regularization (SSDR), to incorporate LDA and MR into a coherent framework for data classification, which exploits both label information and data distribution. Extensive experiments demonstrate the effectiveness of our proposed method in comparison with classical classification algorithms including SVM, LDA and MR. &copy; 2010 Elsevier B.V.
Yang, Y., Zhuang, Y.T., Wu, F. & Pan, Y.H. 2008, 'Harmonizing hierarchical manifolds for multimedia document semantics understanding and cross-media retrieval', IEEE Transactions on Multimedia, vol. 10, no. 3, pp. 437-446.
View/Download from: Publisher's site
In this paper, we consider the problem of multimedia document (MMD) semantics understanding and content-based cross-media retrieval. An MMD is a set of media objects of different modalities but carrying the same semantics and the content-based cross-media retrieval is a new kind of retrieval method by which the query examples and search results can be of different modalities. Two levels of manifolds are learned to explore the relationships among all the data in the level of MMD and in the level of media object respectively. We first construct a Laplacian media object space for media object representation of each modality and an MMD semantic graph to learn the MMD semantic correlations. The characteristics of media objects propagate along the MMD semantic graph and an MMD semantic space is constructed to perform cross-media retrieval. Different methods are proposed to utilize relevance feedback and experiment shows that the proposed approaches are effective. &copy; 2006 IEEE.
Zhuang, Y.T., Yang, Y. & Wu, F. 2008, 'Mining semantic correlation of heterogeneous multimedia data for cross-media retrieval', IEEE Transactions on Multimedia, vol. 10, no. 2, pp. 221-229.
View/Download from: Publisher's site
Although multimedia objects such as images, as udios and texts are of different modalities, there are a great amount of semantic correlations among them. In this paper, we propose a method of transductive learning to mine the semantic correlations among media objects of different modalities so that to achieve the cross-media retrieval. Cross-media retrieval is a new kind of searching technology by which the query examples and the returned results can be of different modalities, e.g., to query images by an example of audio. First, according to the media objects features and their co-existence information, we construct a uniform cross-media correlation graph, in which media objects of different modalities are represented uniformly. To perform the cross-media retrieval, a positive score is assigned to the query example; the score spreads along the graph and media objects of target modality or MMDs with the highest scores are returned. To boost the retrieval performance, we also propose different approaches of long-term and short-term relevance feedback to mine the information contained in the positive and negative examples. &copy; 2008 IEEE.
Zhuang, Y.-.T., Yang, Y. & Wu, F. 2008, 'Mining semantic correlation of heterogeneous multimedia data for cross-media retrieval', IEEE TRANSACTIONS ON MULTIMEDIA, vol. 10, no. 2, pp. 221-229.
View/Download from: Publisher's site
Zhuang, Y., Yang, Y., Wu, F. & Pan, Y. 2007, 'Manifold learning based cross-media retrieval: A solution to media object complementary nature', Journal of VLSI Signal Processing Systems for Signal, Image, and Video Technology, vol. 46, no. 2-3, pp. 153-164.
View/Download from: Publisher's site
Media objects of different modalities always exist jointly and they are naturally complementary of each other, either in the view of semantics or in the view of modality. In this paper, we propose a manifold learning based cross-media retrieval approach that gives solutions to the two intrinsically basic but crucial questions of media objects semantics understanding and cross-media retrieval. First, considering the semantic complementary, how can we represent the concurrent media objects and fuse the complementary information they carry to understand the integrated semantics precisely. Second, considering the modality complementary, how can we accomplish the modality bridge to establish the cross-index and facilitate the cross-media retrieval? To solve the two problems, we first construct a Multimedia Document (MMD) Semi-Semantic Graph (MMDSSG) and then adopt Multidimensional Scaling to create an MMD Semantic Space (MMDSS). Both long-term and short-term feedbacks are proposed to boost the system performance. The first one is used to refine the MMDSSG and the second one is adopted to introduce new items that are not in the training set into the MMDSS. Since all of the MMDs and their component media objects of different modalities lie in the MMDSS and they are indexed uniformly by their coordinates in the MMDSS regardless of their modalities, the semantic subspace is actually a bridge of media objects which are of different modalities and the cross-media retrieval can be easily achieved. Experiment results are encouraging and indicate that the proposed approach is effective. &copy; Springer Science+Business Media, LLC 2007.
Chang, X., Nie, F., Ma, Z. & Yang, Y., 'Balanced k-Means and Min-Cut Clustering'.
Clustering is an effective technique in data mining to generate groups that are the matter of interest. Among various clustering approaches, the family of k-means algorithms and min-cut algorithms gain most popularity due to their simplicity and efficacy. The classical k-means algorithm partitions a number of data points into several subsets by iteratively updating the clustering centers and the associated data points. By contrast, a weighted undirected graph is constructed in min-cut algorithms which partition the vertices of the graph into two sets. However, existing clustering algorithms tend to cluster minority of data points into a subset, which shall be avoided when the target dataset is balanced. To achieve more accurate clustering for balanced dataset, we propose to leverage exclusive lasso on k-means and min-cut to regulate the balance degree of the clustering results. By optimizing our objective functions that build atop the exclusive lasso, we can make the clustering result as much balanced as possible. Extensive experiments on several large-scale datasets validate the advantage of the proposed algorithms compared to the state-of-the-art clustering algorithms.
Chang, X., Nie, F., Yang, Y. & Huang, H., 'Improved Spectral Clustering via Embedded Label Propagation'.
Spectral clustering is a key research topic in the field of machine learning and data mining. Most of the existing spectral clustering algorithms are built upon Gaussian Laplacian matrices, which are sensitive to parameters. We propose a novel parameter free, distance consistent Locally Linear Embedding. The proposed distance consistent LLE promises that edges between closer data points have greater weight.Furthermore, we propose a novel improved spectral clustering via embedded label propagation. Our algorithm is built upon two advancements of the state of the art:1) label propagation,which propagates a node\'s labels to neighboring nodes according to their proximity; and 2) manifold learning, which has been widely used in its capacity to leverage the manifold structure of data points. First we perform standard spectral clustering on original data and assign each cluster to k nearest data points. Next, we propagate labels through dense, unlabeled data regions. Extensive experiments with various datasets validate the superiority of the proposed algorithm compared to current state of the art spectral algorithms.
Zheng, L., Zhang, H., Sun, S., Chandraker, M., Yang, Y. & Tian, Q., 'Person Re-identification in the Wild'.
We present a novel large-scale dataset and comprehensive baselines for end-to-end pedestrian detection and person recognition in raw video frames. Our baselines address three issues: the performance of various combinations of detectors and recognizers, mechanisms for pedestrian detection to help improve overall re-identification accuracy and assessing the effectiveness of different detectors for re-identification. We make three distinct contributions. First, a new dataset, PRW, is introduced to evaluate Person Re-identification in the Wild, using videos acquired through six synchronized cameras. It contains 932 identities and 11,816 frames in which pedestrians are annotated with their bounding box positions and identities. Extensive benchmarking results are presented on this dataset. Second, we show that pedestrian detection aids re-identification through two simple yet effective improvements: a discriminatively trained ID-discriminative Embedding (IDE) in the person subspace using convolutional neural network (CNN) features and a Confidence Weighted Similarity (CWS) metric that incorporates detection scores into similarity measurement. Third, we derive insights in evaluating detector performance for the particular scenario of accurate person re-identification.