UTS site search

Associate Professor Ivor Tsang

Biography

Ivor W Tsang is a Future Fellow and Associate Professor with the Centre for Quantum Computation & Intelligent Systems (QCIS), at the University of Technology Sydney.  His research focuses on transfer learning, feature selection, big data analytics for data with trillions of dimensions, and their applications to computer vision and pattern recognition. He has more than 100 research papers published in top-tier journal and conference papers and his H-index is 39. In 2009, Dr Tsang was conferred the 2008 Natural Science Award (Class II) by Ministry of Education, China, which recognized his contributions to kernel methods. In 2013, Dr Tsang received his prestigious Australian Research Council Future Fellowship for his research regarding Machine Learning on Big Data. In addition, he had received the prestigious IEEE Transactions on Neural Networks Outstanding 2004 Paper Award in 2007, the 2014 IEEE Transactions on Multimedia Prize Paper Award, and a number of best paper awards and honors from reputable international conferences, including the Best Student Paper Award at CVPR 2010, and the Best Paper Award at ICTAI 2011. He was also awarded the ECCV 2012 Outstanding Reviewer Award.

Professional

Services:
  • Area Chair: NIPS 2015
  • Senior Program Committee: ACML 2011, IJCAI 2011, IJCAI 2013, IJCAI 2015, AAAI 2016
  • Local Co-Chair: ACML 2012

Awards and Honorable Mentions: 

  • 2014 IEEE Transactions on Multimedia Best Paper Award
  • Australian Research Council Future Fellowship 2013
  • The 12th European Conference on Computer Vision (ECCV 2012) Outstanding Reviewer Award
  • The Best Poster Award Honorable Mention at the Asian Conference on Machine Learning (ACML 2012)
  • The Best Student Paper Nomination at the IEEE Congress on Evolutionary Computation (CEC 2012)
  • The Best Paper Award at the 23rd IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2011)
  • The Best Student Paper Award at the 23rd IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2010)
  • The 2008 Natural Science Award (Class II) by Ministry of Education, China (bestowed in 2009)
  • IEEE Transactions on Neural Networks Outstanding 2004 Paper Award (bestowed in 2007)
  • The Best Paper Award from the IEEE Hong Kong Chapter of Signal Processing Postgraduate Forum, 2006.
  • Microsoft Fellowship 2005
Future Fellow and Associate Professor, A/DRsch Ctr Quantum Computat'n & Intelligent Systs
Core Member, QCIS - Quantum Computation and Intelligent Systems
 
Phone
+61 2 9514 1885

Research Interests

Big Data Analytics and Scalable Machine Learning Algorithms:

  • -Core Vector Machines (CVMs) for Classification, Regression, Semi-supervised learning, Dimension reduction on Big Data
  • -Ball Vector Machines (BVMs) for Classification on Bigger Data
  • -Matching Pursuit LASSO (MPL) for Sparse Recovery over Big Dictionaries with Million Atoms and Sparse Recovery over Batch Signals
  • -Convex Matching Pursuit (CMP) for large-scale Sparse Coding and Subset Selection
  • -Riemannian Pursuit for Big Matrix Recovery
  • -Hierarchical Maximum Margin Learning for Fast Many Class prediction
  • -Multi-Template Learning (MTL) learns sparse Feature Templates for Structured Prediction
  • -Laplacian Embedded Support Vector Regression (LapESVR) for Scalable Manifold Regularization on a Big Graph
  • -Simple Non-Parametric Kernel Learning (SimpleNPKL) for Large-Scale Semi-Definite Programming, Data Embedding and Visualization

Feature Selection for Big Dimensionality:

  • -Feature Generating Machine (FGM) for Out-of-Core Feature selection on Big Data with Trillion Dimensions
  • -Group Discovery Machine (GDM) for Feature Selection and Grouping on Big Data with Trillion Correlations
  • -Feature Selection with Performance Optimization (FSperf)
  • -Feature Disentangling Machine (FDM) for sparse Multi-Task Feature Selection
  • -Sparse Perceptron Decision Tree (SPDT) for Millions of Dimensions
  • -Learning Sparse Confidence-Weighted Classifier on Very High Dimensional Data with Million Dimensions.
  • -SPARse Graph Embedding (SPARGE) framework unifies ultrahigh dimensional Feature Selection and Graph Embedding based Feature Reduction, such as sparse PCA, sparse LDA, sparse LPP, sparse SDA, etc.
  • -Feature Weighting via Optimal Thresholding (FWOT) for video analysis

Transfer Learning and Domain Adaptation:

  • -Transfer Component Analysis (TCA)
  • -Domain Adaptation Machine (DAM)
  • -Domain Transfer Multiple Kernel Learning (DTMKL)
  • -Adaptive Multiple Kernel Learning (A-MKL)
  • -Transfer Ordinal Label Learning (TOLL) for Ordinal Regression
  • -Maximum Margin Target Label Learning (MMTLL)
  • -Sparse Heterogeneous Feature Representation (SHFR) for Multiclass Heterogeneous Transfer Learning
  • -Heterogeneous Feature Augmentation (HFA)
  • -Hybrid Heterogeneous Transfer Learning (HHTL)
  • -Maximum Penalized Likelihood Kernel Regression (MPLKR) for Fast Speaker Adaptation
  • -Optimizing Performance Measures for Transfer Learning
  • -Transfer Learning for Clustering, Stochastic Optimization, Ensemble Selection, etc
  • -Memetic Search with Inter-Domain Learning

Learning from Ambiguity:

  • -Weakly Labeled Support Vector Machines (WellSVMs) via Label Generation
  • -Input Output Kernel Learning (IOKL)
  • -Co-labeling for multi-view weakly labeled learning
  • -Maximum Entropy Discrimination with Feature Concatenation (MEDFC) for learning with missing views and missing labels
  • -RObust Semi-Supervised Ensemble Learning through label aggregation (ROSSEL) for large-scale SSL
  • -Transductive Ordinal Regression

Sparse Representation and Dictionary Learning:

  • -Laplacian Sparse Coding
  • -Kernel Sparse Representation
  • -Category-Specific Dictionary Learning for Fine-Grained Categorization
  • -Efficient Multi-Layer Group Sparse Coding for Concurrent Image Classification and Annotation
  • -Weighted Sparse Coding

Can supervise: Yes

Chapters

Nie, F., Xu, D., Tsang, I.W. & Zhang, C. 2013, 'A flexible and effective linearization method for subspace learning' in Graph Embedding for Pattern Analysis, pp. 177-203.
View/Download from: Publisher's site
© Springer Science+Business Media New York 2013. In the past decades, a large number of subspace learning or dimension reduction methods [2,16,20,32,34,37,44] have been proposed. Principal component analysis (PCA) [32] pursues the directions of maximum variance for optimal reconstruction. Linear discriminant analysis (LDA) [2], as a supervised algorithm, aims to maximize the inter-class scatter and at the same timeminimize the intra-class scatter. Due to utilization of label information, LDA is experimentally reported to outperform PCA for face recognition, when sufficient labeled face images are provided [2].

Conferences

Tsang, W., Liu, W., Deng, Z., Gong, X. & Jiang, F. 2015, 'Effectively Predicting Whether and When a Topic Will Become Prevalent in a Social Network', Online proceedings of Twenty-Ninth AAAI Conference on Artificial Intelligence, Twenty-Ninth AAAI Conference on Artificial Intelligence, AAI, Austin, Texas, pp. 210-216.
Effective forecasting of future prevalent topics plays animportant role in social network business development.It involves two challenging aspects: predicting whethera topic will become prevalent, and when. This cannotbe directly handled by the existing algorithms in topicmodeling, item recommendation and action forecasting.The classic forecasting framework based on time seriesmodels may be able to predict a hot topic when a seriesof periodical changes to user-addressed frequency in asystematic way. However, the frequency of topics discussedby users often changes irregularly in social networks.In this paper, a generic probabilistic frameworkis proposed for hot topic prediction, and machine learningmethods are explored to predict hot topic patterns.Two effective models, PreWHether and PreWHen, areintroduced to predict whether and when a topic will becomeprevalent. In the PreWHether model, we simulatethe constructed features of previously observed frequencychanges for better prediction. In the PreWHen model,distributions of time intervals associated with the emergenceto prevalence of a topic are modeled. Extensiveexperiments on real datasets demonstrate that ourmethod outperforms the baselines and generates moreeffective predictions.
Yan, Y., Tan, M., Yang, Y., Tsang, I. & Zhang, C. 2015, 'Scalable maximum margin matrix factorization by active riemannian subspace search', Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 2015), IJCAI International Joint Conference on Artificial Intelligence, AAAI, Buenos Aires, Argentina, pp. 3988-3994.
The user ratings in recommendation systems are usually in the form of ordinal discrete values. To give more accurate prediction of such rating data, maximum margin matrix factorization (M3F) was proposed. Existing M3F algorithms, however, either have massive computational cost or require expensive model selection procedures to determine the number of latent factors (i.e. the rank of the matrix to be recovered), making them less practical for large scale data sets. To address these two challenges, in this paper, we formulate M3F with a known number of latent factors as the Riemannian optimization problem on a fixed-rank matrix manifold and present a block-wise nonlinear Riemannian conjugate gradient method to solve it ef- ficiently. We then apply a simple and efficient active subspace search scheme to automatically detect the number of latent factors. Empirical studies on both synthetic data sets and large real-world data sets demonstrate the superior efficiency and effectiveness of the proposed method.
Wang, H., Zhang, P., Tsang, I., Chen, L. & Zhang, C. 2015, 'Defragging Subgraph Features for Graph Classification', Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, The 24th ACM International on Conference on Information and Knowledge Management (CIKM'15), ACM, Melbourne, VIC, Australia, pp. 1687-1690.
View/Download from: Publisher's site
Graph classification is an important tool for analysing structured and semi-structured data, where subgraphs are commonly used as the feature representation. However, the number and size of subgraph features crucially depend on the threshold parameters of frequent subgraph mining algorithms. Any improper setting of the parameters will generate many trivial short-pattern subgraph fragments which dominate the feature space, distort graph classifiers and bury interesting long-pattern subgraphs. In this paper, we propose a new Subgraph Join Feature Selection (SJFS) algorithm. The SJFS algorithm, by forcing graph classifiers to join short-pattern subgraph fragments, can defrag trivial subgraph features and deliver long-pattern interesting subgraphs. Experimental results on both synthetic and real-world social network graph data demonstrate the performance of the proposed method.
Chen, M., Velasco-Forero, S., Tsang, I. & Cham, T.J. 2015, 'Objects co-segmentation: Propagated from simpler images', ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, pp. 1682-1686.
View/Download from: Publisher's site
© 2015 IEEE. Recent works on image co-segmentation aim to segment common objects among image sets. These methods can co-segment simple images well, but their performance may degrade significantly on more cluttered images. In order to co-segment both simple and complex images well, this paper proposes a novel paradigm to rank images and to propagate the segmentation results from the simple images to more and more complex ones. In the experiments, the proposed paradigm demonstrates its effectiveness in segmenting large image sets with a wide variety in object appearance, sizes, orientations, poses, and multiple objects in one image. It outperformed the current state-of-the-art algorithms significantly, especially in difficult images.
Guntuku, S.C., Zhou, J.T., Roy, S., Weisi, L. & Tsang, I.W. 2014, 'Deep representations to model user 'Likes'', Computer Vision -- ACCV 2014 (LNCS), Asian Conference on Computer Vision, Springer, Singapore, Singapore, pp. 3-18.
View/Download from: Publisher's site
© Springer International Publishing Switzerland 2015. Automatically understanding and modeling a user's likingfor an image is a challenging problem. This is because the relationshipbetween the images features (even semantic ones extracted by existingtools, viz. faces, objects etc.) and users' 'likes' is non-linear, influenced by several subtle factors. This work presents a deep bi-modal knowledge representation of images based on their visual content and associated tags(text). A mapping step between the different levels of visual and textual representations allows for the transfer of semantic knowledge between the two modalities. It also includes feature selection before learning deep representation to identify the important features for a user to like an image. Then the proposed representation is shown to be effective in learning a model of users image 'likes' based on a collection of images 'liked' by him. On a collection of images 'liked' by users (from Flickr) the proposed deep representation is shown to better state-of-art low-level features used for modeling user 'likes' by around 15–20 %.
Liu, W. & Tsang, W. 2015, 'Large Margin Metric Learning for Multi-label Prediction', Twenty-Ninth AAAI Conference on Artificial Intelligence, AAAI Conference on Artificial Intelligence, AAAI, Austin, Texas, USA, pp. 2800-2806.
Canonical correlation analysis (CCA) and maximum margin output coding (MMOC) methods have shown promising results for multi-label prediction, where each instance is associated with multiple labels. However, these methods require an expensive decoding procedure to recover the multiple labels of each testing instance. The testing complexity becomes unacceptable when there are many labels. To avoid decoding completely, we present a novel large margin metric learning paradigm for multi-label prediction. In particular, the proposed method learns a distance metric to discover label dependency such that instances with very different multiple labels will be moved far away. To handle many labels, we present an accelerated proximal gradient procedure to speed up the learning process. Comprehensive experiments demonstrate that our proposed method is significantly faster than CCA and MMOC in terms of both training and testing complexities. Moreover, our method achieves superior prediction performance compared with state-of-the-art methods.
Tan, M., Tsang, I., Wang, L., Vandereycken, B. & Pan, S.J. 2014, 'Riemannian Pursuit for Big Matrix Recovery', Journal of Machine Learning Research Workshop and Conference Proceedings, The 31st International Conference on Machine Learning, Microtome Publishing, Beijing, pp. 1539-1547.
Tan, M., Tsang, I.W., Wang, L., Vandereycken, B. & Pan, S.J. 2014, 'Riemannian pursuit for big matrix recovery', 31st International Conference on Machine Learning, ICML 2014, pp. 3467-3483.
Copyright © (2014) by the International Machine Learning Society (IMLS) All rights reserved. Low rank matrix recovery is a fundamental task in many real-world applications. The perfor-mance of existing methods, however, deteriorates significantly when applied to ill-conditioned or large-scale matrices. In this paper, we therefore propose an efficient method, called Riemannian Pursuit (RP), that aims to address these two problems simultaneously. Our method consists of a sequence of fixed-rank optimization problems. Each subproblem, solved by a nonlinear Rieman-nian conjugate gradient method, aims to correct the solution in the most important subspace of increasing size. Theoretically, RP converges linearly under mild conditions and experimental results show that it substantially outperforms existing methods when applied to large-scale and ill-conditioned matrices.
Liu, P., Zhou, T., Tsang, W., Meng, Z., Han, S. & Tong, Y. 2014, 'Feature Disentangling Machine - A Novel Approach of Feature Selection and Disentangling in Facial Expression Analysis', Computer Vision – ECCV 2014, ECCV 2014, Springer International Publishing, Zurich, Switzerland, pp. 151-166.
View/Download from: Publisher's site
Studies in psychology show that not all facial regions are of importance in recognizing facial expressions and different facial regions make different contributions in various facial expressions. Motivated by this, a novel framework, named Feature Disentangling Machine (FDM), is proposed to effectively select active features characterizing facial expressions. More importantly, the FDM aims to disentangle these selected features into non-overlapped groups, in particular, common features that are shared across different expressions and expression-specific features that are discriminative only for a target expression. Specifically, the FDM integrates sparse support vector machine and multi-task learning in a unified framework, where a novel loss function and a set of constraints are formulated to precisely control the sparsity and naturally disentangle active features. Extensive experiments on two well-known facial expression databases have demonstrated that the FDM outperforms the state-of-the-art methods for facial expression analysis. More importantly, the FDM achieves an impressive performance in a cross-database validation, which demonstrates the generalization capability of the selected features.
Xu, Z., Tsang, W., Yang, Y., Ma, Z. & Hauptmann, A.G. 2014, 'Event Detection using Multi-Level Relevance Labels and Multiple Features', 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Columbus, OH, pp. 97-104.
View/Download from: Publisher's site
We address the challenging problem of utilizing related exemplars for complex event detection while multiple features are available. Related exemplars share certain positive elements of the event, but have no uniform pattern due to the huge variance of relevance levels among different related exemplars. None of the existing multiple feature fusion methods can deal with the related exemplars. In this paper, we propose an algorithm which adaptively utilizes the related exemplars by cross-feature learning. Ordinal labels are used to represent the multiple relevance levels of the related videos. Label candidates of related exemplars are generated by exploring the possible relevance levels of each related exemplar via a cross-feature voting strategy. Maximum margin criterion is then applied in our framework to discriminate the positive and negative exemplars, as well as the related exemplars from different relevance levels. We test our algorithm using the large scale TRECVID 2011 dataset and it gains promising performance.
Zhou, J.T., Pan, S.J., Tsang, W. & Yan, Y. 2014, 'Hybrid Heterogeneous Transfer Learning through Deep Learning', Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence, AAAI Conference on Artificial Intelligence, Quebec City; Canada, pp. 2213-2219.
Nguyen, M.L., Tsang, W., Chai, K.M.A. & Chieu, H.L. 2014, 'Robust domain adaptation for relation extraction via clustering consistency', 52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014 - Proceedings of the Conferencee, 52nd Annual Meeting of the Association for Computational Linguistics, Baltimore, MD; United States, pp. 807-817.
Xu, Z., Yang, Y., Tsang, I., Hauptmann, A. & Sebe, N. 2013, 'Feature Weighting via Optimal Thresholding for Video Analysis', 2013 IEEE International Conference on Computer Vision, International Conference on Computer Vision, IEEE, Sydney, pp. 3340-3447.
View/Download from: UTS OPUS or Publisher's site
Fusion of multiple features can boost the performance of large-scale visual classification and detection tasks like TRECVID Multimedia Event Detection (MED) competition [1]. In this paper, we propose a novel feature fusion approach, namely Feature Weighting via Optimal Thresholding (FWOT) to effectively fuse various features. FWOT learns the weights, thresholding and smoothing parameters in a joint framework to combine the decision values obtained from all the individual features and the early fusion. To the best of our knowledge, this is the first work to consider the weight and threshold factors of fusion problem simultaneously. Compared to state-of-the-art fusion algorithms, our approach achieves promising improvements on HMDB [8] action recognition dataset and CCV [5] video classification dataset. In addition, experiments on two TRECVID MED 2011 collections show that our approach outperforms the state-of-the-art fusion methods for complex event detection.
Chen, L., Duan, L., Tsang, I.W. & Xu, D. 2013, 'Efficient discriminative learning of class hierarchy for many class prediction', Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), pp. 274-288.
View/Download from: Publisher's site
Recently the maximum margin criterion has been employed to learn a discriminative class hierarchical model, which shows promising performance for rapid multi-class prediction. Specifically, at each node of this hierarchy, a separating hyperplane is learned to split its associated classes from all of the corresponding training data, leading to a time-consuming training process in computer vision applications with many classes such as large-scale object recognition and scene classification. To address this issue, in this paper we propose a new efficient discriminative class hierarchy learning approach for many class prediction. We first present a general objective function to unify the two state-of-the-art methods for multi-class tasks. When there are many classes, this objective function reveals that some classes are indeed redundant. Thus, omitting these redundant classes will not degrade the prediction performance of the learned class hierarchical model. Based on this observation, we decompose the original optimization problem into a sequence of much smaller sub-problems by developing an adaptive classifier updating method and an active class selection strategy. Specifically, we iteratively update the separating hyperplane by efficiently using the training samples only from a limited number of selected classes that are well separated by the current separating hyperplane. Comprehensive experiments on three large-scale datasets demonstrate that our approach can significantly accelerate the training process of the two state-of-the-art methods while achieving comparable prediction performance in terms of both classification accuracy and testing speed. © 2013 Springer-Verlag.
Zhai, Y.Z., tan, M., Tsang, I. & Ong, Y. 2012, 'Discovering Support and Affiliated Features from Very High Dimensions', Proceedings of the 29 th International Conference on Machine Learning, International Conference on Machine Learning, Omnipress, Edinburgh, Scotland, pp. 1455-1462.
View/Download from: UTS OPUS
In this paper, a novel learning paradigm is presented to automatically identify groups of informative and correlated features from very high dimensions. Specifically, we explicitly incorporate correlation measures as constraints and then propose an efficient embedded feature selection method using recently developed cutting plane strategy. The benefits of the proposed algorithm are two-folds. First, it can identify the optimal discriminative and uncorrelated feature subset to the output labels, denoted here as Support Features, which brings about significant improvements in prediction performance over other state of the art feature selection methods considered in the paper. Second, during the learning process, the underlying group structures of correlated features associated with each support feature, denoted as Affiliated Features, can also be discovered without any additional cost. These affiliated features serve to improve the interpretations on the learning tasks. Extensive empirical studies on both synthetic and very high dimensional real-world datasets verify the validity and efficiency of the proposed method.
Duan, L., Xu, D. & Tsang, I. 2012, 'Learning with Augmented Features for Heterogeneous Domain Adaptation', Proceedings of the 29 th International Conference on Machine Learning, Omnipress, Edinburgh, Scotland, pp. 711-718.
View/Download from: UTS OPUS
We propose a new learning method for heterogeneous domain adaptation (HDA), in which the data from the source domain and the target domain are represented by heterogeneous features with different dimensions. Using two different projection matrices, we first transform the data from two domains into a common subspace in order to measure the similarity between the data from two domains. We then propose two new feature mapping functions to augment the transformed data with their original features and zeros. The existing learning methods (e.g., SVM and SVR) can be readily incorporated with our newly proposed augmented feature representations to effectively utilize the data from both domains for HDA. Using the hinge loss function in SVM as an example, we introduce the detailed objective function in our method called Heterogeneous Feature Augmentation (HFA) for a linear case and also describe its kernelization in order to efficiently cope with the data with very high dimensions. Moreover, we also develop an alternating optimization algorithm to effectively solve the nontrivial optimization problem in our HFA method. Comprehensive experiments on two benchmark datasets clearly demonstrate that HFA outperforms the existing HDA methods.
Xiang, Q., Mao, Q., Chai, K.M., Chieu, H.L., Tsang, I. & Zhao, Z. 2012, 'A Split-Merge Framework for Comparing Clusterings', Proceedings of the 29th International Conference on Machine Learning, International Conference on Machine Learning, Omnipress, Edinburgh, Scotland, pp. 1055-1062.
View/Download from: UTS OPUS
Clustering evaluation measures are frequently used to evaluate the performance of algorithms. However, most measures are not properly normalized and ignore some information in the inherent structure of clusterings. We model the relation between two clusterings as a bipartite graph and propose a general component-based decomposition formula based on the components of the graph. Most existing measures are examples of this formula. In order to satisfy consistency in the component, we further propose a split-merge framework for comparing clusterings of different data sets. Our framework gives measures that are conditionally normalized, and it can make use of data point information, such as feature vectors and pairwise distances. We use an entropy-based instance of the framework and a coreference resolution data set to demonstrate empirically the utility of our framework over other measures.
Xu, X., Tsang, I. & Xu, D. 2012, 'Handling Ambiguity via Input-Output Kernel Learning', 2012 IEEE 12th International Conference on Data Mining (ICDM), IEEE International Conference on Data Mining, IEEE, Brussels, Belgium, pp. 725-734.
View/Download from: UTS OPUS
Data ambiguities exist in many data mining and machine learning applications such as text categorization and image retrieval. For instance, it is generally beneficial to utilize the ambiguous unlabeled documents to learn a more robust classifier for text categorization under the semi-supervised learning setting. To handle general data ambiguities, we present a unified kernel learning framework named Input-Output Kernel Learning (IOKL). Based on our framework, we further propose a novel soft margin group sparse Multiple Kernel Learning (MKL) formulation by introducing a group kernel slack variable to each group of base input-output kernels. Moreover, an efficient block-wise coordinate descent algorithm with an analytical solution for the kernel combination coefficients is developed to solve the proposed formulation. We conduct comprehensive experiments on benchmark datasets for both semi-supervised learning and multiple instance learning tasks, and also apply our IOKL framework to a computer vision application called text-based image retrieval on the NUS-WIDE dataset. Promising results demonstrate the effectiveness of our proposed IOKL framework.
Li, W., Duan, L., Tsang, I. & Xu, D. 2012, 'Co-Labeling: A New Multi-view Learning Approach for Ambiguous Problems', 2012 IEEE 12th International Conference on Data Mining (ICDM), IEEE International Conference on Data Mining, IEEE, Brussels, Belgium, pp. 419-428.
View/Download from: UTS OPUS
We propose a multi-view learning approach called co-labeling which is applicable for several machine learning problems where the labels of training samples are uncertain, including semi-supervised learning (SSL), multi-instance learning (MIL) and max-margin clustering (MMC). Particularly, we first unify those problems into a general ambiguous problem in which we simultaneously learn a robust classifier as well as find the optimal training labels from a finite label candidate set. To effectively utilize multiple views of data, we then develop our co-labeling approach for the general multi-view ambiguous problem. In our work, classifiers trained on different views can teach each other by iteratively passing the predictions of training samples from one classifier to the others. The predictions from one classifier are considered as label candidates for the other classifiers. To train a classifier with a label candidate set for each view, we adopt the Multiple Kernel Learning (MKL) technique by constructing the base kernel through associating the input kernel calculated from input features with one label candidate. Compared with the traditional co-training method which was specifically designed for SSL, the advantages of our co-labeling are two-fold: 1) it can be applied to other ambiguous problems such as MIL and MMC; 2) it is more robust by using the MKL method to integrate multiple labeling candidates obtained from different iterations and biases. Promising results on several real-world multi-view data sets clearly demonstrate the effectiveness of our proposed co-labeling for both MIL and SSL.
Seah, C.W., Tsang, I., Ong, Y. & Mao, Q. 2012, 'Learning Target Predictive Function without Target Labels.', 2012 IEEE 12th International Conference on Data Mining (ICDM), IEEE International Conference on Data Mining, IEEE, Brussels, Belgium, pp. 1098-1103.
View/Download from: UTS OPUS
In the absence of the labeled samples in a domain referred to as target domain, Domain Adaptation (DA) techniques come in handy. Generally, DA techniques assume there are available source domains that share similar predictive function with the target domain. Two core challenges of DA typically arise, variance that exists between source and target domains, and the inherent source hypothesis bias. In this paper, we first propose a Stability Transfer criterion for selecting relevant source domains with low variance. With this criterion, we introduce a TARget learning Assisted by Source Classifier Adaptation (TARASCA) method to address the two core challenges that have impeded the performances of DA techniques. To verify the robustness of TARASCA, extensive experimental studies are carried out with comparison to several state-of-the-art DA methods on the real-world Sentiment and Newsgroups datasets, where various settings for the class ratios of the source and target domains are considered.
Li, W., Duan, L., Tsang, I. & Xu, D. 2012, 'Batch Mode Adaptive Multiple Instance Learning for Computer Vision Tasks', IEEE Computer Society Conference on Computer Vision and Pattern Recognition, IEEE, Providence, Rhode Island.
View/Download from: UTS OPUS
Tan, M., Tsang, W., Wang, L. & Zhang, X. 2012, 'Convex Matching Pursuit for Large-scale Sparse Coding and Subset Selection', Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence, AAAI Conference on Artificial Intelligence, AAAI Press, Toronto, Canada, pp. 1119-1125.
View/Download from: UTS OPUS
In this paper, a new convex matching pursuit scheme is proposed for tackling large-scale sparse coding and subset selection problems. In contrast with current matching pursuit algorithms such as subspace pursuit (SP), the proposed algorithm has a convex formulation and guarantees that the objective value can be monotonically decreased. Moreover, theoretical analysis and experimental results show that the proposed method achieves better scalability while maintaining similar or better decoding ability compared with state-of-the-art methods on large-scale problems.
Yang, J.B., Mao, Q., Xiang, Q.L., Tsang, I.W., Chai, K.M.A. & Chieu, H.L. 2012, 'Domain adaptation for coreference resolution: An adaptive ensemble approach', EMNLP-CoNLL 2012 - 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Proceedings of the Conference, pp. 744-753.
We propose an adaptive ensemble method to adapt coreference resolution across domains. This method has three features: (1) it can optimize for any user-specified objective measure; (2) it can make document-specific prediction rather than rely on a fixed base model or a fixed set of base models; (3) it can automatically adjust the active ensemble members during prediction. With simplification, this method can be used in the traditional within-domain case, while still retaining the above features. To the best of our knowledge, this work is the first to both (i) develop a domain adaptation algorithm for the coreference resolution problem and (ii) have the above features as an ensemble method. Empirically, we show the benefits of (i) on the six domains of the ACE 2005 data set in domain adaptation setting, and of (ii) on both the MUC-6 and the ACE 2005 data sets in within-domain setting. © 2012 Association for Computational Linguistics.
Zhou, J.T., Pan, S.J., Mao, Q. & Tsang, I.W. 2012, 'Multi-view positive and unlabeled learning', Journal of Machine Learning Research, pp. 555-570.
Learning with Positive and Unlabeled instances (PU learning) arises widely in information retrieval applications. To address the unavailability issue of negative instances, most existing PU learning approaches require to either identify a reliable set of negative instances from the unlabeled data or estimate probability densities as an intermediate step. However, inaccurate negative-instance identification or poor density estimation may severely degrade overall performance of the final predictive model. To this end, we propose a novel PU learning method based on density ratio estimation without constructing any sets of negative instances or estimating any intermediate densities. To further boost PU learning performance, we extend our proposed learning method in a multi-view manner by utilizing multiple heterogeneous sources. Extensive experimental studies demonstrate the effectiveness of our proposed methods, especially when positive labeled data are limited. © 2012 J.T. Zhou, S.J. Pan, Q. Mao & I.W. Tsang.
Seah, C.W., Ong, Y.S., Tsang, I.W. & Jiang, S. 2012, 'Pareto rank learning in multi-objective evolutionary algorithms', 2012 IEEE Congress on Evolutionary Computation, CEC 2012.
View/Download from: Publisher's site
In this paper, the interest is on cases where assessing the goodness of a solution for the problem is costly or hazardous to construct or extremely computationally intensive to compute. We label such category of problems as "expensive" in the present study. In the context of multi-objective evolutionary optimizations, the challenge amplifies, since multiple criteria assessments, each defined by an "expensive" objective is necessary and it is desirable to obtain the Pareto-optimal solution set under a limited resource budget. To address this issue, we propose a Pareto Rank Learning scheme that predicts the Pareto front rank of the offspring in MOEAs, in place of the "expensive" objectives when assessing the population of solutions. Experimental study on 19 standard multi-objective benchmark test problems concludes that Pareto rank learning enhanced MOEA led to significant speedup over the state-of-the-art NSGA-II, MOEA/D and SPEA2. © 2012 IEEE.
Feng, L., Ong, Y.S., Tsang, I.W.H. & Tan, A.H. 2012, 'An evolutionary search paradigm that learns with past experiences', 2012 IEEE Congress on Evolutionary Computation, CEC 2012.
View/Download from: Publisher's site
A major drawback of evolutionary optimization approaches in the literature is the apparent lack of automated knowledge transfers and reuse across problems. Particularly, evolutionary optimization methods generally start a search from scratch or ground zero state, independent of how similar the given new problem of interest is to those optimized previously. In this paper, we present a study on the transfer of knowledge in the form of useful structured knowledge or latent patterns that are captured from previous experiences of problem-solving to enhance future evolutionary search. The essential contributions of our present study include the meme learning and meme selection processes. In contrast to existing methods, which directly store and reuse specific problem solutions or problem sub-components, the proposed approach models the structured knowledge of the strategy behind solving problems belonging to similar domain, i.e., via learning the mapping from problem to its corresponding solution, which is encoded in the form of identified knowledge representation. In this manner, knowledge transfer can be conducted across problems, from differing problem size, structure to representation, etc. A demonstrating case study on the capacitated arc routing problem (CARP) is presented. Experiments on benchmark instances of CARP verified the effectiveness of the proposed new paradigm. © 2012 IEEE.
Li, W., Duan, L., Xu, D. & Tsang, I. 2011, 'Text-based Image Retrieval using Progressive Multi-Instance Learning', 2011 IEEE International Conference on Computer Vision, IEEE International Conference on Computer Vision, IEEE, Barcelona, Spain, pp. 2049-2055.
View/Download from: UTS OPUS
Relevant and irrelevant images collected from the Web (e.g., Flickr.com) have been employed as loosely labeled training data for image categorization and retrieval. In this work, we propose a new approach to learn a robust classifier for text-based image retrieval (TBIR) using relevant and irrelevant training web images, in which we explicitly handle noise in the loose labels of training images. Specifically, we first partition the relevant and irrelevant training web images into clusters. By treating each cluster as a "bag" and the images in each bag as "instances ", we formulate this task as a multi-instance learning problem with constrained positive bags, in which each positive bag contains at least a portion of positive instances. We present a new algorithm called MIL-CPB to effectively exploit such constraints on positive bags and predict the labels of test instances (images). Observing that the constraints on positive bags may not always be satisfied in our application, we additionally propose a progressive scheme (referred to as Progressive MIL-CPB, or PMIL-CPB) to further improve the retrieval performance, in which we iteratively partition the top-ranked training web images from the current MILCPB classifier to construct more confident positive "bags" and then add these new "bags" as training data to learn the subsequent MIL-CPB classifiers. Comprehensive experiments on two challenging real-world web image data sets demonstrate the effectiveness of our approach.
Seah, C.W., Tsang, I. & Ong, Y. 2011, 'Healing Sample Selection Bias by Source Classifier Selection', 2011 IEEE 11th International Conference on Data Mining (ICDM), IEEE International Conference on Data Mining, IEEE, Vancouver, Canada, pp. 577-586.
View/Download from: UTS OPUS
Domain Adaptation (DA) methods are usually carried out by means of simply reducing the marginal distribution differences between the source and target domains, and subsequently using the resultant trained classifier, namely source classifier, for use in the target domain. However, in many cases, the true predictive distributions of the source and target domains can be vastly different especially when their class distributions are skewed, causing the issues of sample selection bias in DA. Hence, DA methods which leverage the source labeled data may suffer from poor generalization in the target domain, resulting in negative transfer. In addition, we observed that many DA methods use either a source classifier or a linear combination of source classifiers with a fixed weighting for predicting the target unlabeled data. Essentially, the labels of the target unlabeled data are spanned by the prediction of these source classifiers. Motivated by these observations, in this paper, we propose to construct many source classifiers of diverse biases and learn the weight for each source classifier by directly minimizing the structural risk defined on the target unlabeled data so as to heal the possible sample selection bias. Since the weights are learned by maximizing the margin of separation between opposite classes on the target unlabeled data, the proposed method is established here as Maximal Margin Target Label Learning (MMTLL), which is in a form of Multiple Kernel Learning problem with many label kernels. Extensive experimental studies of MMTLL against several state-of-the-art methods on the Sentiment and Newsgroups datasets with various imbalanced class settings showed that MMTLL exhibited robust accuracies on all the settings considered and was resilient to negative transfer, in contrast to other counterpart methods which suffered significantly in prediction accuracy.
Mao, Q. & Tsang, I. 2011, 'Optimizing Performance Measures for Feature Selection', 2011 IEEE 11th International Conference on Data Mining (ICDM), IEEE International Conference on Data Mining, IEEE, Vancouver, Canada, pp. 1170-1175.
View/Download from: UTS OPUS
Feature selection with specific multivariate performance measures is the key to the success of many applications, such as information retrieval and bioinformatics. The existing feature selection methods are usually designed for classification error. In this paper, we present a unified feature selection framework for general loss functions. In particular, we study the novel feature selection paradigm by optimizing multivariate performance measures. The resultant formulation is a challenging problem for high-dimensional data. Hence, a two-layer cutting plane algorithm is proposed to solve this problem, and the convergence is presented. Extensive experiments on largescale and high-dimensional real world datasets show that the proposed method outperforms l1-SVM and SVM-RFE when choosing a small subset of features, and achieves significantly improved performances over SVMperf in terms of F1-score.
Tsang, W. & Yang, J. 2011, 'Hierarchical Maximum Margin Learning for Multi-Class Classification', Proceedings of the Twenty-Seventh Conference Conference on Uncertainty in Artificial Intelligence, Conference in Uncertainty in Artificial Intelligence, Corvallis, Oregon, pp. 753-760.
View/Download from: UTS OPUS
Due to myriads of classes, designing accurate and efficient classifiers becomes very challenging for multi-class classification. Recent research has shown that class structure learning can greatly facilitate multi-class learning. In this paper, we propose a novel method to learn the class structure for multi-class classification problems. The class structure is assumed to be a binary hierarchical tree. To learn such a tree, we propose a maximum separating margin method to determine the child nodes of any internal node. The proposed method ensures that two classgroups represented by any two sibling nodes are most separable. In the experiments, we evaluate the accuracy and efficiency of the proposed method over other multi-class classification methods on real world large-scale problems. The results show that the proposed method outperforms benchmark methods in terms of accuracy for most datasets and performs comparably with other class structure learning methods in terms of efficiency for all datasets.
Li, S. & Tsang, I.W. 2011, 'Learning to locate relative outliers', Journal of Machine Learning Research, pp. 47-62.
Outliers usually spread across regions of low density. However, due to the absence or scarcity of outliers, designing a robust detector to sift outliers from a given dataset is still very challenging. In this paper, we consider to identify relative outliers from the target dataset with respect to another reference dataset of normal data. Particularly, we employ Maximum Mean Discrepancy (MMD) for matching the distribution between these two datasets and present a novel learning framework to learn a relative outlier detector. The learning task is formulated as a Mixed Integer Programming (MIP) problem, which is computationally hard. To this end, we propose an effective procedure to find a largely violated labeling vector for identifying relative outliers from abundant normal patterns, and its convergence is also presented. Then, a set of largely violated labeling vectors are combined by multiple kernel learning methods to robustly locate relative outliers. Comprehensive empirical studies on real-world datasets verify that our proposed relative outlier detection outperforms existing methods. © 2011 S. Li & I.W. Tsang.
Li, S., Tsang, I.W. & Chaudhari, N.S. 2011, 'Infinite decision agent ensemble learning system for credit risk analysis', Proceedings - 10th International Conference on Machine Learning and Applications, ICMLA 2011, pp. 36-39.
View/Download from: Publisher's site
Considering the special needs of credit risk analysis, the Infinite DEcision Agent ensemble Learning (IDEAL) system is proposed. In the first level of our model, we adopt soft margin boosting to overcome over fitting. In the second level, the RVM algorithm is revised for boosting so that different RVM agents can be generated from the updated instance space of the data. In the third level, the perceptron kernel is employed in RVM to generate infinite subagents. Our IDEAL system also shares some good properties, such as good generalization performance, immunity to over fitting and predicting the distance to default. According to the experimental results, our proposed system can achieve better performance in term of sensitivity, specificity and overall accuracy. © 2011 IEEE.
Li, S. & Tsang, I.W. 2011, 'Maximum margin/volume outlier detection', Proceedings - International Conference on Tools with Artificial Intelligence, ICTAI, pp. 385-392.
View/Download from: Publisher's site
Due to the absence or scarcity of outliers, designing a robust outlier detector is very challenging. In this paper, we first propose to use the maximum margin criterion to sift unknown outliers from a given data set, which demonstrates superior performance in outlier detection. However, the resultant learning task is formulated as a Mixed Integer Programming (MIP) problem, which is computationally hard. To this end, we alter the recently developed label generating technique, which efficiently solves a convex relaxation of the MIP problem of outlier detection. Specifically, we propose an effective procedure to find a largely violated labeling vector for identifying rare outliers from abundant normal patterns, and its convergence is also presented. Then, a set of largely violated labeling vectors are combined by multiple kernel learning methods to robustly detect outliers. Besides these, to further enhance the efficacy of our outlier detector, we also explore the use of maximum volume criterion to measure the quality of separation between outliers and normal patterns. This criterion can be easily incorporated into our proposed framework by introducing an additional regularization. Comprehensive experiments on real-world data sets verify that the outlier detectors using the two proposed criteria outperform existing outlier detection methods. © 2011 IEEE.
Zhuang, J., Tsang, I.W. & Hoi, S.C.H. 2011, 'Two-layer multiple kernel learning', Journal of Machine Learning Research, pp. 909-917.
Multiple Kernel Learning (MKL) aims to learn kernel machines for solving a real machine learning problem (e.g. classification) by exploring the combinations of multiple kernels. The traditional MKL approach is in general "shallow" in the sense that the target kernel is simply a linear (or convex) combination of some base kernels. In this paper, we investigate a framework of Multi-Layer Multiple Kernel Learning (MLMKL) that aims to learn "deep" kernel machines by exploring the combinations of multiple kernels in a multi-layer structure, which goes beyond the conventional MKL approach. Through a multiple layer mapping, the proposed MLMKL framework offers higher flexibility than the regular MKL for finding the optimal kernel for applications. As the first attempt to this new MKL framework, we present a Two-Layer Multiple Kernel Learning (2LMKL) method together with two efficient algorithms for classification tasks. We analyze their generalization performances and have conducted an extensive set of experiments over 16 benchmark datasets, in which encouraging results showed that our method performed better than the conventional MKL methods. Copyright 2011 by the authors.
Gao, S., Chia, L.T. & Tsang, I.W.H. 2011, 'Multi-layer group sparse coding - For concurrent image classification and annotation', Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 2809-2816.
View/Download from: Publisher's site
We present a multi-layer group sparse coding framework for concurrent image classification and annotation. By leveraging the dependency between image class label and tags, we introduce a multi-layer group sparse structure of the reconstruction coefficients. Such structure fully encodes the mutual dependency between the class label, which describes the image content as a whole, and tags, which describe the components of the image content. Then we propose a multi-layer group based tag propagation method, which combines the class label and subgroups of instances with similar tag distribution to annotate test images. Moreover, we extend our multi-layer group sparse coding in the Reproducing Kernel Hilbert Space (RKHS) which captures the nonlinearity of features, and further improves performances of image classification and annotation. Experimental results on the LabelMe, UIUC-Sport and NUS-WIDE-Object databases show that our method outperforms the baseline methods, and achieves excellent performances in both image classification and annotation tasks. © 2011 IEEE.
tan, M., Wang, L. & Tsang, I. 2010, 'Learning Sparse SVM for Feature Selection on Very High Dimensional Datasets', Proceedings of the 27 th International Conference on Machine Learning, International Conference on Machine Learning, Omnipress, Haifa, Israel, pp. 1047-1054.
View/Download from: UTS OPUS
A sparse representation of Support Vector Ma- chines (SVMs) with respect to input features is desirable for many applications. In this paper, by introducing a 0-1 control variable to each input feature, l0-norm Sparse SVM (SSVM) is con- verted to a mixed integer programming (MIP) problem. Rather than directly solving this MIP, we propose an efficient cutting plane algorithm combining with multiple kernel learning to solve its convex relaxation. A global convergence proof for our method is also presented. Compre- hensive experimental results on one synthetic and 10 real world datasets show that our proposed method can obtain better or competitive perfor- mance compared with existing SVM-based fea- ture selection methods in term of sparsity and generalization performance. Moreover, our pro- posed method can effectively handle large-scale and extremely high dimensional problems.
Gao, S., Tsang, I., Chia, L. & Zhao, P. 2010, 'Local Features Are Not Lonely - Laplacian Sparse Coding for Image Classification', 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE Conference on Computer Vision and Pattern Recognition, IEEE, San Francisco, CA, pp. 3555-3561.
View/Download from: UTS OPUS
Sparse coding which encodes the original signal in a sparse signal space, has shown its state-of-the-art performance in the visual codebook generation and feature quantization process of BoW based image representation. However, in the feature quantization process of sparse coding, some similar local features may be quantized into different visual words of the codebook due to the sensitiveness of quantization. In this paper, to alleviate the impact of this problem, we propose a Laplacian sparse coding method, which will exploit the dependence among the local features. Specifically, we propose to use histogram intersection based kNN method to construct a Laplacian matrix, which can well characterize the similarity of local features. In addition, we incorporate this Laplacian matrix into the objective function of sparse coding to preserve the consistence in sparse representation of similar local features. Comprehensive experimental results show that our method achieves or outperforms existing state-of-the-art results, and exhibits excellent performance on Scene 15 data set.
Duan, L., Xu, D., Tsang, I. & Luo, J. 2010, 'Visual Event Recognition in Videos by Learning from Web Data', 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE Conference on Computer Vision and Pattern Recognition, IEEE, San Francisco, CA, pp. 1959-1966.
View/Download from: UTS OPUS
We propose a visual event recognition framework for consumer domain videos by leveraging a large amount of loosely labeled web videos (e.g., from YouTube). First, we propose a new aligned space-time pyramid matching method to measure the distances between two video clips, where each video clip is divided into space-time volumes over multiple levels. We calculate the pair-wise distances between any two volumes and further integrate the information from different volumes with Integer-flow Earth Movers Distance (EMD) to explicitly align the volumes. Second, we propose a new cross-domain learning method in order to 1) fuse the information from multiple pyramid levels and features (i.e., space-time feature and static SIFT feature) and 2) cope with the considerable variation in feature distributions between videos from two domains (i.e., web domain and consumer domain). For each pyramid level and each type of local features, we train a set of SVM classifiers based on the combined training set from two domains using multiple base kernels of different kernel types and parameters, which are fused with equal weights to obtain an average classifier. Finally, we propose a cross-domain learning method, referred to as Adaptive Multiple Kernel Learning (A-MKL), to learn an adapted classifier based on multiple base kernels and the prelearned average classifiers by minimizing both the structural risk functional and the mismatch between data distributions from two domains. Extensive experiments demonstrate the effectiveness of our proposed framework that requires only a small number of labeled consumer videos by leveraging web data.
Chen, L., Xu, D., Tsang, I. & Luo, J. 2010, 'Tag-based Web Photo Retrieval Improved by Batch Mode Re-Tagging', 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE Conference on Computer Vision and Pattern Recognition, IEEE, San Francisco, CA, pp. 3440-3446.
View/Download from: UTS OPUS
Web photos in social media sharing websites such as Flickr are generally accompanied by rich but noisy textual descriptions (tags, captions, categories, etc.). In this paper, we proposed a tag-based photo retrieval framework to improve the retrieval performance for Flickr photos by employing a novel batch mode re-tagging method. The proposed batch mode re-tagging method can automatically refine noisy tags of a group of Flickr photos uploaded by the same user within a short period by leveraging millions of training web images and their associated rich textual descriptions. Specifically, for one group of Flickr photos, we construct a group-specific lexicon which contains only the tags of all photos within the group. For each query tag, we employ the inverted file method to automatically find loosely labeled training web images. We propose a SVM with Augmented Features, referred to as AFSVM, to learn adapted classifiers to refine the annotation tags of photos by leveraging the existing SVM classifiers of popular tags, which are associated with a large amount of positive training web images. Moreover, to further refine the annotation tags of photos in the same group, we additionally introduce an objective function that utilizes the visual similarities of photos within the group as well as the semantic proximities of their tags. Based on the refined tags, photos can be retrieved according to more reliable relevance scores. Extensive experiments demonstrate the effectiveness of our framework.
Chen, B., Lam, W., Tsang, W. & Wong, T.K. 2010, 'Location and Scatter Matching for Dataset Shift in Text Mining', Proceedings of the IEEE International Conference on Data Mining, IEEE International Conference on Data Mining, IEE, Sydney, Australia, pp. 773-778.
View/Download from: UTS OPUS or Publisher's site
Dataset shift from the training data in a source domain to the data in a target domain poses a great challenge for many statistical learning methods. Most algorithms can be viewed as exploiting only the first-order statistics, namely, the empirical mean discrepancy to evaluate the distribution gap. Intuitively, considering only the empirical mean may not be statistically efficient. In this paper, we propose a non-parametric distance metric with a good property which jointly considers the empirical mean (Location) and sample covariance (Scatter) difference. More specifically, we propose an improved symmetric Stein's loss function which combines the mean and covariance discrepancy into a unified Bregman matrix divergence of which Jensen-Shannon divergence between normal distributions is a particular case. Our target is to find a good feature representation which can reduce the distribution gap between different domains, at the same time, ensure that the new derived representation can encode most discriminative components with respect to the label information. We have conducted extensive experiments on several document classification datasets to demonstrate the effectiveness of our proposed method.
Tsang, W. & Mao, Q. 2010, 'Parameter-Free Spectral Kernel Learning', Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence (UAI2010), Conference in Uncertainty in Artificial Intelligence, AUAI Press, Catalina Island, California, pp. 350-357.
View/Download from: UTS OPUS
Due to the growing ubiquity of unlabeled data, learning with unlabeled data is attracting increasing attention in machine learning. In this paper, we propose a novel semi-supervised kernel learning method which can seamlessly combine manifold structure of unlabeled data and Regularized Least-Squares (RLS) to learn a new kernel. Interestingly, the new kernel matrix can be obtained analytically with the use of spectral decomposition of graph Laplacian matrix. Hence, the proposed algorithm does not require any numerical optimization solvers. Moreover, by maximizing kernel target alignment on labeled data, we can also learn model parameters automatically with a closed-form solution. For a given graph Laplacian matrix, our proposed method does not need to tune any model parameter including the tradeoff parameter in RLS and the balance parameter for unlabeled data. Extensive experiments on ten benchmark datasets show that our proposed two-stage parameter-free spectral kernel learning algorithm can obtain comparable performance with fine-tuned manifold regularization methods in transductive setting, and outperform multiple kernel learning in supervised setting.
Gao, S., Tsang, W. & Chia, L.-.T. 2010, 'Kernel Sparse Representation for Image Classification and Face Recognition', Computer Vision - ECCV 2010, Lecture Notes in Computer Science, European Conference on Computer Vision, Springer, Crete, Greece, pp. 1-14.
View/Download from: UTS OPUS
Recent research has shown the effectiveness of using sparse coding(Sc) to solve many computer vision problems. Motivated by the fact that kernel trick can capture the nonlinear similarity of features, which may reduce the feature quantization error and boost the sparse coding performance, we propose Kernel Sparse Representation(KSR). KSR is essentially the sparse coding technique in a high dimensional feature space mapped by implicit mapping function. We apply KSR to both image classification and face recognition. By incorporating KSR into Spatial Pyramid Matching(SPM), we propose KSRSPM for image classification. KSRSPM can further reduce the information loss in feature quantization step compared with Spatial Pyramid Matching using Sparse Coding(ScSPM). KSRSPM can be both regarded as the generalization of Efficient Match Kernel(EMK) and an extension of ScSPM. Compared with sparse coding, KSR can learn more discriminative sparse codes for face recognition. Extensive experimental results show that KSR outperforms sparse coding and EMK, and achieves state-of-the-art performance for image classification and face recognition on publicly available datasets.
Xu, X., Xu, D. & Tsang, I.W. 2010, 'Video concept detection using support vector machine with augmented features', Proceedings - 4th Pacific-Rim Symposium on Image and Video Technology, PSIVT 2010, pp. 381-385.
View/Download from: Publisher's site
In this paper, we present a direct application of Support Vector Machine with Augmented Features (AFSVM) for video concept detection. For each visual concept, we learn an adapted classifier by leveraging the pre-learnt SVM classifiers of other concepts. The solution of AFSVM is to retrain the SVM classifier using augmented feature, which concatenates the original feature vector with the decision value vector obtained from the pre-learnt SVM classifiers in the Reproducing Kernel Hilbert Space (RKHS). The experiments on the challenging TRECVID 2005 dataset demonstrate the effectiveness of AFSVM for video concept detection. © 2010 IEEE.
Gao, S., Wang, Z., Chia, L.T. & Tsang, I.W.H. 2010, 'Automatic image tagging via category label and web data', MM'10 - Proceedings of the ACM Multimedia 2010 International Conference, pp. 1115-1118.
View/Download from: Publisher's site
Image tagging is an important technique for the image content understanding and text based image processing. Given a selection of images, how to tag these images efficiently and effectively is an interesting problem. In this paper, a novel semi-auto image tagging technique is proposed: By assigning each image a category label first, our method can automatically recommend those promising tags to each image by utilizing existing vast web data. The main contributions of our paper can be highlighted as follows: (i) By assigning each image a category label, our method can automatically recommend other tags to the image, thus reducing the human annotation efforts. Meanwhile, our method guarantee tags' diversity due to abundant web data. (ii) We use sparse coding to automatically select those semantically related images for tag propagation. (iii) Local & global ranking agglomeration will make our method robust to noisy tags. We use Event dataset as the images to be tagged, and crawled Flickr images with their associated tags according to the category label in Event dataset as the auxiliary web data. Experimental results show that our method achieves promising performance for image tagging, which proves the effectiveness of our method. © 2010 ACM.
Seah, C.W., Tsang, I.W., Ong, Y.S. & Lee, K.K. 2010, 'Predictive distribution matching SVM for multi-domain learning', Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), pp. 231-247.
View/Download from: Publisher's site
Domain adaptation (DA) using labeled data from related source domains comes in handy when the labeled patterns of a target domain are scarce. Nevertheless, it is worth noting that when the predictive distribution P(y|x) of the domains differs, which establishes Negative Transfer [19], DA approaches generally fail to perform well. Taking this cue, the Predictive Distribution Matching SVM (PDM-SVM) is proposed to learn a robust classifier in the target domain (referred to as the target classifier) by leveraging the labeled data from only the relevant regions of multiple sources. In particular, a k-nearest neighbor graph is iteratively constructed to identify the regions of relevant source labeled data where the predictive distribution maximally aligns with that of the target data. Predictive distribution matching regularization is then introduced to leverage these relevant source labeled data for training the target classifier. In addition, progressive transduction is adopted to infer the label of target unlabeled data for estimating the predictive distribution of the target domain. Finally, extensive experiments are conducted to illustrate the impact of Negative Transfer on several existing state-of-the-art DA methods, and demonstrate the improved performance efficacy of our proposed PDM-SVM on the commonly used multi-domain Sentiment and Reuters datasets. © 2010 Springer-Verlag Berlin Heidelberg.
Zhuang, J., Tsang, I. & Hoi, S. 2009, 'SimpleNPKL: Simple Non-Parametric Kernel Learning', Proceedings of the 26 th International Conference on Machine Learning, International Conference on Machine Learning, Omnipress, Montreal Quebec, pp. 1273-1280.
View/Download from: UTS OPUS
Previous studies of Non-Parametric Kernel (NPK) learning usually reduce to solving some Semi-Definite Programming (SDP) problem by a standard SDP solver. However, time complexity of standard interior-point SDP solvers could be as high as O(n6:5). Such intensive computation cost prohibits NPK learning applicable to real applications, even for data sets of moderate size. In this paper, we propose an efficient approach to NPK learning from side information, referred to as SimpleNPKL, which can efficiently learn non-parametric kernels from large sets of pairwise constraints. In particular, we show that the proposed SimpleNPKL with linear loss has a closed-form solution that can be simply computed by the Lanczos algorithm. Moreover, we show that the SimpleNPKL with square hinge loss can be re-formulated as a saddle-point optimization task, which can be further solved by a fast iterative algorithm. In contrast to the previous approaches, our empirical results show that our new technique achieves the same accuracy, but is significantly more efficient and scalable.
Pan, S.J., Tsang, I., Kwok, J.T. & Yang, Q. 2009, 'Domain Adaptation via Transfer Component Analysis', International Joint Conference on Artificial Intelligence, AAAI, Pasadena, CA, USA.
View/Download from: UTS OPUS
Nie, F., Xu, D., Tsang, I. & Zhang, C. 2009, 'Spectral Embedded Clustering', International Joint Conference on Artificial Intelligence, AAAI, Pasadena, CA, USA.
View/Download from: UTS OPUS
Duan, L., Tsang, I., Xu, D. & Chua, T. 2009, 'Domain Adaptation from Multiple Sources via Auxiliary Classifiers', Proceedings of the 26 th International Conference on Machine Learning, International Conference on Machine Learning, Omnipress, Montreal, Quebec, pp. 289-296.
View/Download from: UTS OPUS
We propose a multiple source domain adaptation method, referred to as Domain Adaptation Machine (DAM), to learn a robust decision function (referred to as target classifier) for label prediction of patterns from the target domain by leveraging a set of pre-computed classifiers (referred to as auxiliary/source classifiers) independently learned with the labeled patterns from multiple source domains. We introduce a new datadependent regularizer based on smoothness assumption into Least-Squares SVM (LS-SVM), which enforces that the target classifier shares similar decision values with the auxiliary classifiers from relevant source domains on the unlabeled patterns of the target domain. In addition, we employ a sparsity regularizer to learn a sparse target classifier. Comprehensive experiments on the challenging TRECVID 2005 corpus demonstrate that DAM outperforms the existing multiple source domain adaptation methods for video concept detection in terms of effectiveness and efficiency.
Chen, B., Lam, W., Tsang, W. & Wong, T.-.L. 2009, 'Extracting Discriminative Concepts for Domain Adaptation in Text Mining', ACM SIGKDD conference on Knowledge Discovery and Data Mining, ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, Paris, France, pp. 179-188.
View/Download from: UTS OPUS
One common predictive modeling challenge occurs in text mining problems is that the training data and the operational (testing) data are drawn from different underlying distributions. This poses a great difficulty for many statistical learning methods. However, when the distribution in the source domain and the target domain are not identical but related, there may exist a shared concept space to preserve the relation. Consequently a good feature representation can encode this concept space and minimize the distribution gap. To formalize this intuition, we propose a domain adaptation method that parameterizes this concept space by linear transformation under which we explicitly minimize the distribution difference between the source domain with sufficient labeled data and target domains with only unlabeled data, while at the same time minimizing the empirical loss on the labeled data in the source domain. Another characteristic of our method is its capability for considering multiple classes and their interactions simultaneously. We have conducted extensive experiments on two common text mining problems, namely, information extraction and document classification to demonstrate the effectiveness of our proposed method
Li, Y.-.F., Kwok, J., Tsang, W. & Zhou, Z.-.H. 2009, 'A Convex Method for Locating Regions of Interest with Multi-Instance Learning', European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (LNCS 5782) Part II, European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, Springer, Bled, Slovenia, pp. 15-30.
View/Download from: UTS OPUS
In content-based image retrieval (CBIR) and image screening, it is often desirable to locate the regions of interest (ROI) in the images automatically. This can be accomplished with multi-instance learning techniques by treating each image as a bag of instances (regions). Many SVM-based methods are successful in predicting the bag labels, however, few of them can locate the ROIs. Moreover, they are often based on either local search or an EM-style strategy, and may get stuck in local minima easily. In this paper, we propose two convex optimization methods which maximize the margin of concepts via key instance generation at the instance-level and bag-level, respectively. Our formulation can be solved efficiently with a cutting plane algorithm. Experiments show that the proposed methods can effectively locate ROIs, and they also achieve performances competitive with state-of-the-art algorithms on benchmark data sets.
Chen, B., Lam, W., Tsang, W. & Wong, T.-.L. 2009, 'A Semi-Supervised Framework for Feature Mapping and Multiclass Classification', Proceedings of SIAM International Conference on Data Mining, SIAM International Conference on Data Mining, SIAM, Sparks, Nevada.
View/Download from: UTS OPUS
We propose a semi-supervised framework incorporat- ing feature mapping with multiclass classication. By learning multiple classication tasks simultaneously, this framework can learn the latent feature space eec- tively for both labeled and unlabeled data. The knowl- edge in the transformed space can be transferred not only between the labeled and unlabeled data, but also across multiple classes, so as to improve the classica- tion performance given a small amount of labeled data. We show that this problem is equivalent to a sequen- tial convex optimization problem by applying constraint concave-convex procedure (CCCP). Ecient algorithm with theoretical guarantee is proposed and computa- tional issue is investigated. Extensive experiments have been conducted to demonstrate the eectiveness of our proposed framework.
Li, Y.F., Tsang, I.W., Kwok, J.T. & Zhou, Z.H. 2009, 'Tighter and convex maximum margin clustering', Journal of Machine Learning Research, pp. 344-351.
Maximum margin principle has been successfully applied to many supervised and semi-supervised problems in machine learning. Recently, this principle was extended for clustering, referred to as Maximum Margin Clustering (MMC) and achieved promising performance in recent studies. To avoid the problem of local minima, MMC can be solved globally via convex semi-definite programming (SDP) relaxation. Although many efficient approaches have been pro- posed to alleviate the computational burden of SDP, convex MMCs are still not scalable for medium data sets. In this paper, we propose a novel convex optimization method, LG-MMC, which maximizes the margin of opposite clusters via "Label Generation". It can be shown that LG-MMC is much more scalable than existing convex approaches. Moreover, we show that our convex relaxation is tighter than state-of-art con- vex MMCs. Experiments on seventeen UCI datasets and MNIST dataset show significant improvement over existing MMC algorithms.© 2009 by the authors.
Liu, Y., Xu, D., Tsang, I.W. & Luo, J. 2009, 'Using large-scale web data to facilitate textual query based retrieval of consumer photos', MM'09 - Proceedings of the 2009 ACM Multimedia Conference, with Co-located Workshops and Symposiums, pp. 55-64.
View/Download from: Publisher's site
The rapid popularization of digital cameras and mobile phone cameras has lead to an explosive growth of consumer photo collections. In this paper, we present a (quasi) real-time textual query based personal photo retrieval system by leveraging millions of web images and their associated rich textual descriptions (captions, categories, etc.). After a user provides a textual query (e.g., "pool"), our system exploits the inverted file method to automatically find the positive web images that are related to the textual query "pool" as well as the negative web images which are irrelevant to the textual query. Based on these automatically retrieved relevant and irrelevant web images, we employ two simple but effective classification methods, k Nearest Neighbor (kNN) and decision stumps, to rank personal consumer photos. To further improve the photo retrieval performance, we propose three new relevance feedback methods via cross-domain learning. These methods effectively utilize both the web images and the consumer images. In particular, our proposed cross-domain learning methods can learn robust classifiers with only a very limited amount of labeled consumer photos from the user by leveraging the pre-learned decision stumps at interactive response time. Extensive experiments on both consumer and professional stock photo datasets demonstrated the effectiveness and efficiency of our system, which is also inherently not limited by any predefined lexicon. Copyright 2009 ACM.
Liu, Y., Xu, D., Tsang, I.W. & Luo, J. 2009, 'T-IRS: Textual query based image retrieval system for consumer photos', MM'09 - Proceedings of the 2009 ACM Multimedia Conference, with Co-located Workshops and Symposiums, pp. 983-984.
View/Download from: Publisher's site
In this demonstration, we present a (quasi) real-time textual query based image retrieval system (T-IRS) for consumer photos by leveraging millions of web images and their associated rich textual descriptions (captions, categories, etc.). After a user provides a textual query (e.g., "boat"), our system automatically finds the positive web images that are related to the textual query "boat" as well as the negative web images which are irrelevant to the textual query. Based on these automatically retrieved positive and negative web images, we employ the decision stump ensemble classifier to rank personal consumer photos. To further improve the photo retrieval performance, we also develop a novel relevance feedback method, referred to as Cross-Domain Regularized Regression (CDRR), which effectively utilizes both the web images and the consumer images. Our system is inherently not limited by any predefined lexicon.
Duan, L., Tsang, I.W., Xu, D. & Maybank, S.J. 2009, 'Domain transfer SVM for video concept detection', 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2009, pp. 1375-1381.
View/Download from: Publisher's site
Cross-domain learning methods have shown promising results by leveraging labeled patterns from auxiliary domains to learn a robust classifier for target domain, which has a limited number of labeled samples. To cope with the tremendous change of feature distribution between different domains in video concept detection, we propose a new cross-domain kernel learning method. Our method, referred to as Domain Transfer SVM (DTSVM), simultaneously learns a kernel function and a robust SVM classifier by minimizing both the structural risk functional of SVM and the distribution mismatch of labeled and unlabeled samples between the auxiliary and target domains. Comprehensive experiments on the challenging TRECVID corpus demonstrate that DTSVM outperforms existing crossdomain learning and multiple kernel learning methods. © 2009 IEEE.
Duan, L., Tsang, I.W., Xu, D. & Chua, T.S. 2009, 'Domain adaptation from multiple sources via auxiliary classifiers', ACM International Conference Proceeding Series.
View/Download from: Publisher's site
We propose a multiple source domain adaptation method, referred to as Domain Adaptation Machine (DAM), to learn a robust decision function (referred to as target classifier) for label prediction of patterns from the target domain by leveraging a set of pre-computed classifiers (referred to as auxiliary/source classifiers) independently learned with the labeled patterns from multiple source domains. We introduce a new datadependent regularizer based on smoothness assumption into Least-Squares SVM (LS-SVM), which enforces that the target classifier shares similar decision values with the auxiliary classifiersfrom relevant source domains on the unlabeled patterns of the target domain. In addition, we employ a sparsity regularizer to learn a sparse target classifier. Comprehensive experiments on the challenging TRECVID 2005 corpus demonstrate that DAM outperforms the existing multiple source domain adaptation methods for video concept detection in terms of effectiveness and fficiency. Copyright 2009.
Zhuang, J., Tsang, I.W. & Hoi, S.C.H. 2009, 'SimpleNPKL : Simple non-parametric kernel learning', ACM International Conference Proceeding Series.
View/Download from: Publisher's site
Previous studies of Non-Parametric Kernel (NPK) learning usually reduce to solving some Semi-Definite Programming (SDP) problem by a standard SDP solver. However, time complexity of standard interior-point SDP solvers could be as high as O(n6.5). Such intensive computation cost prohibits NPK learning applicable to real applications, even for data sets of moderate size. In this paper, we propose an efficient approach to NPK learning from side information, referred to as SimpleNPKL, which can efficiently learn non-parametric kernels from large sets of pairwise constraints. In particular, we show that the proposed SimpleNPKL with linear loss has a closed-form solution that can be simply computed by the Lanczos algorithm. Moreover, we show that the SimpleNPKL with square hinge loss can be re-formulated as a saddle-point optimization task, which can be further solved by a fast iterative algorithm. In contrast to the previous approaches, our empirical results show that our new technique achieves the same accuracy, but is significantly more efficient and scalable. Copyright 2009.
Zhang, K., Tsang, I. & Kwok, J. 2008, 'Improved Nystrom low rank approximation and error analysis', Proceedings of the 25 th International Conference on Machine Learning, International Conference on Machine Learning, Omnipress, Helsinki, Finland, pp. 1232-1239.
View/Download from: UTS OPUS
Low-rank matrix approximation is an effective tool in alleviating the memory and computational burdens of kernel methods and sampling, as the mainstream of such algorithms, has drawn considerable attention in both theory and practice. This paper presents detailed studies on the Nystr¨om sampling scheme and in particular, an error analysis that directly relates the Nystr¨om approximation quality with the encoding powers of the landmark points in summarizing the data. The resultant error bound suggests a simple and efficient sampling scheme, the k-means clustering algorithm, for Nystr¨om low-rank approximation. We compare it with state-of-the-art approaches that range from greedy schemes to probabilistic sampling. Our algorithm achieves significant performance gains in a number of supervised/ unsupervised learning tasks including kernel PCA and least squares SVM.
Tsang, I.W. & Kwok, J.T. 2007, 'Large-scale sparsified manifold regularization', Advances in Neural Information Processing Systems, pp. 1401-1408.
Semi-supervised learning is more powerful than supervised learning by using both labeled and unlabeled data. In particular, the manifold regularization framework, together with kernel methods, leads to the Laplacian SVM (LapSVM) that has demonstrated state-of-the-art performance. However, the LapSVM solution typically involves kernel expansions of all the labeled and unlabeled examples, and is slow on testing. Moreover, existing semi-supervised learning methods, including the LapSVM, can only handle a small number of unlabeled examples. In this paper, we integrate manifold regularization with the core vector machine, which has been used for large-scale supervised and unsupervised learning. By using a sparsified manifold regularizer and formulating as a center-constrained minimum enclosing ball problem, the proposed method produces sparse solutions with low time and space complexities. Experimental results show that it is much faster than the LapSVM, and can handle a million unlabeled examples on a standard PC; while the LapSVM can only handle several thousand patterns.
Tsang, I.W. & Kwok, J.T. 2007, 'Ensembles of partially trained SVMs with multiplicative updates', IJCAI International Joint Conference on Artificial Intelligence, pp. 1089-1094.
The training of support vector machines (SVM) involves a quadratic programming problem, which is often optimized by a complicated numerical solver. In this paper, we propose a much simpler approach based on multiplicative updates. This idea was first explored in [Cristianini et al., 1999], but its convergence is sensitive to a learning rate that has to be fixed manually. Moreover, the update rule only works for the hard-margin SVM, which is known to have poor performance on noisy data. In this paper, we show that the multiplicative update of SVM can be formulated as a Bregman projection problem, and the learning rate can then be adapted automatically. Moreover, because of the connection between boosting and Bregman distance, we show that this multiplicative update for SVM can be regarded as boosting the (weighted) Parzen window classifiers. Motivated by the success of boosting, we then consider the use of an adaptive ensemble of the partially trained SVMs. Extensive experiments show that the proposed multiplicative update rule with an adaptive learning rate leads to faster and more stable convergence. Moreover, the proposed ensemble has efficient training and comparable or even better accuracy than the best-tuned soft-margin SVM.
Zhang, K., Tsang, I.W. & Kwok, J.T. 2007, 'Maximum margin clustering made practical', ACM International Conference Proceeding Series, pp. 1119-1126.
View/Download from: Publisher's site
Maximum margin clustering (MMC) is a recent large margin unsupervised learning approach that has often outperformed conventional clustering methods. Computationally, it involves non-convex optimization and has to be relaxed to different semidefinite programs (SDP). However, SDP solvers are computationally very expensive and only small data sets can be handled by MMC so far. To make MMC more practical, we avoid SDP relaxations and propose in this paper an efficient approach that performs alternating optimization directly on the original non-convex problem. A key step to avoid premature convergence is on the use of SVR with the Laplacian loss, instead of SVM with the hinge loss, in the inner optimization subproblem. Experiments on a number of synthetic and real-world data sets demonstrate that the proposed approach is often more accurate, much faster and can handle much larger data sets.
Tsang, I.W., Kocsor, A. & Kwok, J.T. 2007, 'Simpler core vector machines with enclosing balls', ACM International Conference Proceeding Series, pp. 911-918.
View/Download from: Publisher's site
The core vector machine (CVM) is a recent approach for scaling up kernel methods based on the notion of minimum enclosing ball (MEB). Though conceptually simple, an efficient implementation still requires a sophisticated numerical solver. In this paper, we introduce the enclosing ball (EB) problem where the ball's radius is fixed and thus does not have to be minimized. We develop efficient (1 + e)-approximation algorithms that are simple to implement and do not require any numerical solver. For the Gaussian kernel in particular, a suitable choice of this (fixed) radius is easy to determine, and the center obtained from the (1 + e)-approximation of this EB problem is close to the center of the corresponding MEB. Experimental results show that the proposed algorithm has accuracies comparable to the other large-scale SVM implementations, but can handle very large data sets and is even faster than the CVM in general.
Tsang, I.W., Kocsor, A. & Kwok, J.T. 2006, 'Diversified SVM ensembles for large data sets', Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), pp. 792-800.
Recently, the core vector machine (CVM) has shown significant speedups on classification and regression problems with massive data sets. Its performance is also almost as accurate as other state-ofthe-art SVM implementations. By incorporating the orthogonality constraints to diversify the CVM ensembles, this turns out to speed up the maximum margin discriminant analysis (MMDA) algorithm. Extensive comparisons with the MMDA ensemble along with bagging on a number of large data sets show that the proposed diversified CVM ensemble can improve classification performance, and is also faster than the original MMDA algorithm by more than an order of magnitude. © Springer-Verlag Berlin Heidelberg 2006.
Tsang, I.W., Kocsor, A. & Kwok, J.T. 2006, 'Efficient kernel feature extraction for massive data sets', Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 724-729.
Maximum margin discriminant analysis (MMDA) was proposed that uses the margin idea for feature extraction. It often outperforms traditional methods like kernel principal component analysis (KPCA) and kernel Fisher discriminant analysis (KFD). However, as in other kernel methods, its time complexity is cubic in the number of training points m, and is thus computationally inefficient on massive data sets. In this paper, we propose an (1 + ) 2-approximation algorithm for obtaining the MMDA features by extending the core vector machines. The resultant time complexity is only linear in m, while its space complexity is independent of m. Extensive comparisons with the original MMDA, KPCA, and KFD on a number of large data sets show that the proposed feature extractor can improve classification accuracy, and is also faster than these kernel-based methods by more than an order of magnitude. Copyright 2006 ACM.
Tsang, I.W., Kwok, J.T. & Li, S. 2006, 'Learning the kernel in mahalanobis one-class support vector machines', IEEE International Conference on Neural Networks - Conference Proceedings, pp. 1169-1175.
In this paper, we show that one-class SVMs can also utilize data covariance in a robust manner to improve performance. Furthermore, by constraining the desired kernel function as a convex combination of base kernels, we show that the weighting coefficients can be learned via quadratically constrained quadratic programming (QCQP) or second order cone programming (SOCP) methods. Performance on both toy and real-world data sets show promising results. This paper thus offers another demonstration of the synergy between convex optimization and kernel methods. © 2006 IEEE.
Tsang, I.W., Kwok, J.T., Mak, B., Zhang, K. & Pan, J.J. 2006, 'Fast speaker adaption via maximum penalized likelihood kernel regression', ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, pp. I997-I1000.
Maximum likelihood linear regression (MLLR) has been a popular speaker adaptation method for many years. In this paper, we investigate a generalization of MLLR using non-linear regression. Specifically, kernel regression is applied with appropriate regularization to determine the transformation matrix in MLLR for fast speaker adaptation. The proposed method, called maximum penalized likelihood kernel regression adaptation (MPLKR), is computationally simple and the mean vectors of the speaker adapted acoustic model can be obtained analytically by simply solving a linear system. Since no nonlinear optimization is involved, the obtained solution is always guaranteed to be globally optimal. The new adaptation method was evaluated on the Resource Management task with 5s and 10s of adaptation speech. Results show that MPLKR outperforms the standard MLLR method. © 2006 IEEE.
Tsang, I.W., Kwok, J.T. & Cheung, P.M. 2005, 'Very large SVM training using Core Vector Machines', AISTATS 2005 - Proceedings of the 10th International Workshop on Artificial Intelligence and Statistics, pp. 349-356.
Standard SVM training has O(m 3) time and O(m 2) space complexities, where m is the training set size. In this paper, we scale up kernel methods by exploiting the "approximateness" in practical SVM implementations. We formulate many kernel methods as equivalent minimum enclosing ball problems in computational geometry, and then obtain provably approximately optimal solutions efficiently with the use of core-sets. Our proposed Core Vector Machine (CVM) algorithm has a time complexity that is linear in m and a space complexity that is independent of m. Experiments on large toy and real-world data sets demonstrate that the CVM is much faster and can handle much larger data sets than existing scale-up methods. In particular, on our PC with only 512M RAM, the CVM with Gaussian kernel can process the checkerboard data set with 1 million points in less than 13 seconds.
Tsang, I.W., Cheung, P.M. & Kwok, J.T. 2005, 'Kernel relevant component analysis for distance metric learning', Proceedings of the International Joint Conference on Neural Networks, pp. 954-959.
View/Download from: Publisher's site
Defining a good distance measure between patterns is of crucial importance in many classification and clustering algorithms. Recently, relevant component analysis (RCA) is proposed which offers a simple yet powerful method to learn this distance metric. However, it is confined to linear transforms in the input space. In this paper, we show that RCA can also be kernelized, which then results in significant improvements when nonlinearities are needed. Moreover, it becomes applicable to distance metric learning for structured objects that have no natural vectorial representation. Besides, it can be used in an incremental setting. Performance of this kernel method is evaluated on both toy and real-world data sets with encouraging results. © 2005 IEEE.
Tsang, I.W., Kwok, J.T. & Lai, K.T. 2005, 'Core vector regression for very large regression problems', ICML 2005 - Proceedings of the 22nd International Conference on Machine Learning, pp. 913-920.
In this paper, we extend the recently proposed Core Vector Machine algorithm to the regression setting by generalizing the underlying minimum enclosing ball problem. The resultant Core Vector Regression (CVR) algorithm can be used with any linear/nonlinear kernels and can obtain provably approximately optimal solutions. Its asymptotic time complexity is linear in the number of training patterns m, while its space complexity is independent of m. Experiments show that CVR has comparable performance with SVR, but is much faster and produces much fewer support vectors on very large data sets. It is also successfully applied to large 3D point sets in computer graphics for the modeling of implicit surfaces.
Wong, K.F.S., Tsang, I.W., Cheung, V., Chan, S.H.G. & Kwok, J.T. 2005, 'Position estimation tor wireless sensor networks', GLOBECOM - IEEE Global Telecommunications Conference, pp. 2772-2776.
View/Download from: Publisher's site
In wireless sensor networks, estimating nodal positions is important for routing efficiency and location-based services. Traditional techniques based on precise measurements are often expensive and power-inefficient, while approaches based on landmarks often require bandwidth-inefficient flooding and hence are not scalable for large networks. In this paper, we propose and investigate a cost-effective and distributed algorithm to accurately estimate nodal positions for wireless sensor networks. In our algorithm, a node only needs to identify and exchange information with a certain number of neighbors (around 30) in its proximity in order to estimate its relative nodal position accurately. For location-identification, only a small number of nodes (around 10) are needed to have additional GPS capabilities to accurately estimate the absolute position of every node in the network. Our algorithm is shown to have fast convergence with low estimation error, even for large networks. © 2005 IEEE.
Tsang, I.W. & Kwok, J.T. 2004, 'Efficient hyperkernel learning using second-order cone programming', Lecture Notes in Artificial Intelligence (Subseries of Lecture Notes in Computer Science), pp. 453-464.
The kernel function plays a central role in kernel methods. Most existing methods can only adapt the kernel parameters or the kernel matrix based on empirical data. Recently, Ong et al. introduced the method of hyperkernels which can be used to learn the kernel function directly in an inductive setting [12]. However, the associated optimization problem is a semidefinite program (SDP), which is very computationally expensive even with the recent advances in interior point methods. In this paper, we show that this learning problem can be equivalently reformulated as a second-order cone program (SOCP), which can then be solved more efficiently than SDPs. Experimental results on both toy and real-world data sets show significant speedup. Moreover, in comparison with the kernel matrix learning method proposed by Lanckriet et al. [7], our proposed SOCP-based hyperkernel method yields better generalization performance, with a speed that is comparable to their formulation based on quadratically constrained quadratic programming (QCQP). © Springer-Verlag Berlin Heidelberg 2004.
Chu, C.S., Tsang, I.W. & Kwok, J.T. 2004, 'Scaling up support vector data description by using core-sets', IEEE International Conference on Neural Networks - Conference Proceedings, pp. 425-430.
Support vector data description (SVDD) is a powerful kernel method that has been commonly used for novelty detection. While its quadratic programming formulation has the important computational advantage of avoiding the problem of local minimum, this has a runtime complexity of O(N3), where N is the number of training patterns. It thus becomes prohibitive when the data set is large. Inspired from the use of core-sets in approximating the minimum enclosing ball problem in computational geometry, we propose in this paper an approximation method that allows SVDD to scale better to larger data sets. Most importantly, the proposed method has a running time that is only linear in N. Experimental results on two large real-world data sets demonstrate that the proposed method can handle data sets that are much larger than those that can be handled by standard SVDD packages, while its approximate solution still attains equally good, or sometimes even better, novelty detection performance.
Kwok, J.T. & Tsang, I.W. 2003, 'Learning with Idealized Kernels', Proceedings, Twentieth International Conference on Machine Learning, pp. 400-407.
The kernel function plays a central role in kernel methods. Existing methods typically fix the functional form of the kernel in advance and then only adapt the associated kernel parameters based on empirical data. In this paper, we consider the problem of adapting the kernel so that it becomes more similar to the so-called ideal kernel. We formulate this as a distance metric learning problem that searches for a suitable linear transform (feature weighting) in the kernel-induced feature space. This formulation is applicable even when the training set can only provide examples of similar and dissimilar pairs, but not explicit class label information. Computationally, this leads to a local-optima-free quadratic programming problem, with the number of variables independent of the number of features. Performance of this method is evaluated on classification and clustering tasks on both toy and real-world data sets.
Kwok, J.T. & Tsang, I.W. 2003, 'The Pre-Image Problem in Kernel Methods', Proceedings, Twentieth International Conference on Machine Learning, pp. 408-415.
In this paper, we address the problem of finding the pre-image of a feature vector in the feature space induced by a kernel. This is of central importance in some kernel applications, such as on using kernel principal component analysis (PCA) for image denoising. Unlike the traditional method in (Mika et al., 1998) which relies on nonlinear optimization, our proposed method directly finds the location of the pre-image based on distance constraints in the feature space. It is non-iterative, involves only linear algebra and does not suffer from numerical instability or local minimum problems. Performance of this method is evaluated on performing kernel PCA and kernel clustering on the USPS data set.

Journal articles

Tan, M., Tsang, I.W. & Wang, L. 2015, 'Matching pursuit LASSO part I: Sparse recovery over big dictionary', IEEE Transactions on Signal Processing, vol. 63, no. 3, pp. 727-741.
View/Download from: Publisher's site
© 2015 IEEE. Large-scale sparse recovery (SR) by solving 1-norm relaxations over Big Dictionary is a very challenging task. Plenty of greedy methods have therefore been proposed to address big SR problems, but most of them require restricted conditions for the convergence. Moreover, it is non-trivial for them to incorporate the 1-norm regularization that is required for robust signal recovery. We address these issues in this paper by proposing a Matching Pursuit LASSO (MPL) algorithm, based on a novel quadratically constrained linear program (QCLP) formulation, which has several advantages over existing methods. Firstly, it is guaranteed to converge to a global solution. Secondly, it greatly reduces the computation cost of the 1-norm methods over Big Dictionaries. Lastly, the exact sparse recovery condition of MPL is also investigated.
Tan, M., Tsang, I.W. & Wang, L. 2015, 'Matching pursuit LASSO part II: Applications and sparse recovery over batch signals', IEEE Transactions on Signal Processing, vol. 63, no. 3, pp. 742-753.
View/Download from: Publisher's site
© 2015 IEEE. In Part I, a Matching Pursuit LASSO (MPL) algorithm has been presented for solving large-scale sparse recovery (SR) problems. In this paper, we present a subspace search to further improve the performance of MPL, and then continue to address another major challenge of SR-batch SR with many signals, a consideration which is absent from most of previous 1-norm methods. A batch-mode MPL is developed to vastly speed up sparse recovery of many signals simultaneously. Comprehensive numerical experiments on compressive sensing and face recognition tasks demonstrate the superior performance of MPL and BMPL over other methods considered in this paper, in terms of sparse recovery ability and efficiency. In particular, BMPL is up to 400 times faster than existing 1-norm methods considered to be state-of-the-art.
Chen, M., Tsang, I.W., Tan, M. & Cham, T.J. 2015, 'A unified feature selection framework for graph embedding on high dimensional data', IEEE Transactions on Knowledge and Data Engineering, vol. 27, no. 6, pp. 1465-1477.
View/Download from: Publisher's site
© 2014 IEEE. Although graph embedding has been a powerful tool for modeling data intrinsic structures, simply employing all features for data structure discovery may result in noise amplification. This is particularly severe for high dimensional data with small samples. To meet this challenge, this paper proposes a novel efficient framework to perform feature selection for graph embedding, in which a category of graph embedding methods is cast as a least squares regression problem. In this framework, a binary feature selector is introduced to naturally handle the feature cardinality in the least squares formulation. The resultant integral programming problem is then relaxed into a convex Quadratically Constrained Quadratic Program (QCQP) learning problem, which can be efficiently solved via a sequence of accelerated proximal gradient (APG) methods. Since each APG optimization is w.r.t. only a subset of features, the proposed method is fast and memory efficient. The proposed framework is applied to several graph embedding learning problems, including supervised, unsupervised, and semi-supervised graph embedding. Experimental results on several high dimensional data demonstrated that the proposed method outperformed the considered state-of-the-art methods.
Feng, L., Ong, Y.S., Tan, A.H. & Tsang, I.W. 2015, 'Memes as building blocks: a case study on evolutionary optimization + transfer learning for routing problems', Memetic Computing.
View/Download from: Publisher's site
© 2015 Springer-Verlag Berlin Heidelberg A significantly under-explored area of evolutionary optimization in the literature is the study of optimization methodologies that can evolve along with the problems solved. Particularly, present evolutionary optimization approaches generally start their search from scratch or the ground-zero state of knowledge, independent of how similar the given new problem of interest is to those optimized previously. There has thus been the apparent lack of automated knowledge transfers and reuse across problems. Taking this cue, this paper presents a Memetic Computational Paradigm based on Evolutionary Optimization(Formula presented.)Transfer Learning for search, one that models how human solves problems, and embarks on a study towards intelligent evolutionary optimization of problems through the transfers of structured knowledge in the form of memes as building blocks learned from previous problem-solving experiences, to enhance future evolutionary searches. The proposed approach is composed of four culture-inspired operators, namely, Learning, Selection, Variation and Imitation. The role of the learning operator is to mine for latent knowledge buried in past experiences of problem-solving. The learning task is modelled as a mapping between past problem instances solved and the respective optimized solution by maximizing their statistical dependence. The selection operator serves to identify the high quality knowledge that shall replicate and transmit to future search, while the variation operator injects new innovations into the learned knowledge. The imitation operator, on the other hand, models the assimilation of innovated knowledge into the search. Studies on two separate established NP-hard problem domains and a realistic package collection/deliver problem are conducted to assess and validate the benefits of the proposed new memetic computation paradigm.
Feng, L., Ong, Y.-.S., Lim, M.-.H. & Tsang, I.W. 2015, 'Memetic Search With Interdomain Learning: A Realization Between CVRP and CARP', IEEE Transactions on Evolutionary Computation, vol. 19, no. 5, pp. 644-658.
View/Download from: Publisher's site
Tan, M., Tsang, I. & Wang, L. 2014, 'Towards Ultrahigh Dimensional Feature Selection for Big Data', Journal of Machine Learning Research, vol. 15, pp. 1371-1429.
Li, W., Duan, L., Xu, D. & Tsang, I. 2014, 'Learning with Augmented Features for Supervised and Semi-supervised Heterogeneous Domain Adaptation', IEEE Transactions On Pattern Analysis And Machine Intelligence, vol. 36, no. 6, pp. 1134-1148.
View/Download from: Publisher's site
Zhai, Y., Ong, Y.-.S. & Tsang, I.W. 2014, 'The Emerging "Big Dimensionality"', IEEE Computational Intelligence Magazine, vol. 9, no. 3, pp. 14-26.
View/Download from: Publisher's site
The world continues to generate quintillion bytes of data daily, leading to the pressing needs for new efforts in dealing with the grand challenges brought by Big Data. Today, there is a growing consensus among the computational intelligence communities that data volume presents an immediate challenge pertaining to the scalability issue. However, when addressing volume in Big Data analytics, researchers in the data analytics community have largely taken a one-sided study of volume, which is the "Big Instance Size" factor of the data. The flip side of volume which is the dimensionality factor of Big Data, on the other hand, has received much lesser attention. This article thus represents an attempt to fill in this gap and places special focus on this relatively under-explored topic of "Big Dimensionality", wherein the explosion of features (variables) brings about new challenges to computational intelligence. We begin with an analysis on the origins of Big Dimensionality. The evolution of feature dimensionality in the last two decades is then studied using popular data repositories considered in the data analytics and computational intelligence research communities. Subsequently, the state-of-the-art feature selection schemes reported in the field of computational intelligence are reviewed to reveal the inadequacies of existing approaches in keeping pace with the emerging phenomenon of Big Dimensionality. Last but not least, the "curse and blessing of Big Dimensionality" are delineated and deliberated.
Mao, Q., Tsang, I.W., Gao, S. & Wang, L. 2014, 'Generalized Multiple Kernel Learning With Data-Dependent Priors', IEEE Transactions on Neural Networks and Learning Systems.
View/Download from: Publisher's site
Multiple kernel learning (MKL) and classifier ensemble are two mainstream methods for solving learning problems in which some sets of features/views are more informative than others, or the features/views within a given set are inconsistent. In this paper, we first present a novel probabilistic interpretation of MKL such that maximum entropy discrimination with a noninformative prior over multiple views is equivalent to the formulation of MKL. Instead of using the noninformative prior, we introduce a novel data-dependent prior based on an ensemble of kernel predictors, which enhances the prediction performance of MKL by leveraging the merits of the classifier ensemble. With the proposed probabilistic framework of MKL, we propose a hierarchical Bayesian model to learn the proposed data-dependent prior and classification model simultaneously. The resultant problem is convex and other information (e.g., instances with either missing views or missing labels) can be seamlessly incorporated into the data-dependent priors. Furthermore, a variety of existing MKL models can be recovered under the proposed MKL framework and can be readily extended to incorporate these priors. Extensive experiments demonstrate the benefits of our proposed framework in supervised and semisupervised settings, as well as in tasks with partial correspondence among multiple views.
Chen, L., Xu, D., Tsang, I.W. & Li, X. 2014, 'Spectral embedded hashing for scalable image retrieval.', IEEE transactions on cybernetics, vol. 44, no. 7, pp. 1180-1190.
We propose a new graph based hashing method called spectral embedded hashing (SEH) for large-scale image retrieval. We first introduce a new regularizer into the objective function of the recent work spectral hashing to control the mismatch between the resultant hamming embedding and the low-dimensional data representation, which is obtained by using a linear regression function. This linear regression function can be employed to effectively handle the out-of-sample data, and the introduction of the new regularizer makes SEH better cope with the data sampled from a nonlinear manifold. Considering that SEH cannot efficiently cope with the high dimensional data, we further extend SEH to kernel SEH (KSEH) to improve the efficiency and effectiveness, in which a nonlinear regression function can also be employed to obtain the low dimensional data representation. We also develop a new method to efficiently solve the approximate solution for the eigenvalue decomposition problem in SEH and KSEH. Moreover, we show that some existing hashing methods are special cases of our KSEH. Our comprehensive experiments on CIFAR, Tiny-580K, NUS-WIDE, and Caltech-256 datasets clearly demonstrate the effectiveness of our methods.
Ren, Z., Gao, S., Chia, L.T. & Tsang, I.W.H. 2014, 'Region-based saliency detection and its application in object recognition', IEEE Transactions on Circuits and Systems for Video Technology, vol. 24, no. 5, pp. 769-779.
View/Download from: Publisher's site
The objective of this paper is twofold. First, we introduce an effective region-based solution for saliency detection. Then, we apply the achieved saliency map to better encode the image features for solving object recognition task. To find the perceptually and semantically meaningful salient regions, we extract superpixels based on an adaptive mean shift algorithm as the basic elements for saliency detection. The saliency of each superpixel is measured by using its spatial compactness, which is calculated according to the results of Gaussian mixture model (GMM) clustering. To propagate saliency between similar clusters, we adopt a modified PageRank algorithm to refine the saliency map. Our method not only improves saliency detection through large salient region detection and noise tolerance in messy background, but also generates saliency maps with a well-defined object shape. Experimental results demonstrate the effectiveness of our method. Since the objects usually correspond to salient regions, and these regions usually play more important roles for object recognition than background, we apply our achieved saliency map for object recognition by incorporating a saliency map into sparse coding-based spatial pyramid matching (ScSPM) image representation. To learn a more discriminative codebook and better encode the features corresponding to the patches of the objects, we propose a weighted sparse coding for feature coding. Moreover, we also propose a saliency weighted max pooling to further emphasize the importance of those salient regions in feature pooling module. Experimental results on several datasets illustrate that our weighted ScSPM framework greatly outperforms ScSPM framework, and achieves excellent performance for object recognition. © 2013 IEEE.
Gao, S., Chia, L.T., Tsang, I.W.H. & Ren, Z. 2014, 'Concurrent single-label image classification and annotation via efficient multi-layer group sparse coding', IEEE Transactions on Multimedia, vol. 16, no. 3, pp. 762-771.
View/Download from: Publisher's site
We present a multi-layer group sparse coding framework for concurrent single-label image classification and annotation. By leveraging the dependency between image class label and tags, we introduce a multi-layer group sparse structure of the reconstruction coefficients. Such structure fully encodes the mutual dependency between the class label, which describes image content as a whole, and tags, which describe the components of the image content. Therefore we propose a multi-layer group based tag propagation method, which combines the class label and subgroups of instances with similar tag distribution to annotate test images. To make our model more suitable for nonlinear separable features, we also extend our multi-layer group sparse coding in the Reproducing Kernel Hilbert Space (RKHS), which further improves performances of image classification and annotation. Moreover, we also integrate our multi-layer group sparse coding with kNN strategy, which greatly improves the computational efficiency. Experimental results on the LabelMe, UIUC-Sports and NUS-WIDE-Object databases show that our method outperforms the baseline methods, and achieves excellent performances in both image classification and annotation tasks. © 2014 IEEE.
Gao, S., Tsang, I.W.H. & Ma, Y. 2014, 'Learning category-specific dictionary and shared dictionary for fine-grained image categorization', IEEE Transactions on Image Processing, vol. 23, no. 2, pp. 623-634.
View/Download from: Publisher's site
This paper targets fine-grained image categorization by learning a category-specific dictionary for each category and a shared dictionary for all the categories. Such category-specific dictionaries encode subtle visual differences among different categories, while the shared dictionary encodes common visual patterns among all the categories. To this end, we impose incoherence constraints among the different dictionaries in the objective of feature coding. In addition, to make the learnt dictionary stable, we also impose the constraint that each dictionary should be self-incoherent. Our proposed dictionary learning formulation not only applies to fine-grained classification, but also improves conventional basic-level object categorization and other tasks such as event recognition. Experimental results on five data sets show that our method can outperform the state-of-the-art fine-grained image categorization frameworks as well as sparse coding based dictionary learning frameworks. All these results demonstrate the effectiveness of our method. © 1992-2012 IEEE.
Mao, Q. & Tsang, I. 2013, 'A Feature Selection Method for Multivariate Performance Measures', IEEE Transactions On Pattern Analysis And Machine Intelligence, vol. 35, no. 9, pp. 2051-2063.
View/Download from: UTS OPUS or Publisher's site
Feature selection with specific multivariate performance measures is the key to the success of many applications such as image retrieval and text classification. The existing feature selection methods are usually designed for classification error. In this paper, we propose a generalized sparse regularizer. Based on the proposed regularizer, we present a unified feature selection framework for general loss functions. In particular, we study the novel feature selection paradigm by optimizing multivariate performance measures. The resultant formulation is a challenging problem for high-dimensional data. Hence, a two-layer cutting plane algorithm is proposed to solve this problem, and the convergence is presented. In addition, we adapt the proposed method to optimize multivariate measures for multiple-instance learning problems. The analyses by comparing with the state-of-the-art feature selection methods show that the proposed method is superior to others. Extensive experiments on large-scale and high-dimensional real-world datasets show that the proposed method outperforms l1-SVM and SVM-RFE when choosing a small subset of features, and achieves significantly improved performances over SVMperf in terms of F1-score.
Li, Y., Tsang, I., Kwok, J.T. & Zhou, Z. 2013, 'Convex and Scalable Weakly Labeled SVMs', Journal of Machine Learning Research, vol. 14, no. 1, pp. 2151-2188.
View/Download from: UTS OPUS
In this paper, we study the problem of learning from weakly labeled data, where labels of the training examples are incomplete. This includes, for example, (i) semi- supervised learning where labels are partially known; (ii) multi-instance learning where labels are implicitly known; and (iii) clustering where labels are completely unknown. Unlike supervised learning, learning with weak labels involves a difficult Mixed-Integer Programming (MIP) problem. Therefore, it can suffer from poor scalability and may also get stuck in local minimum. In this paper, we focus on SVMs and propose the WELLSVM via a novel label generation strategy. This leads to a convex relaxation of the original MIP, which is at least as tight as existing convex Semi-Definite Programming (SDP) relaxations. Moreover, the WELLSVM can be solved via a sequence of SVM subproblems that are much more scalable than previous convex SDP relaxations. Experiments on three weakly labeled learning tasks, namely, (i) semi-supervised learning; (ii) multi-instance learning for locating regions of interest in content-based information retrieval; and (iii) clustering, clearly demonstrate improved performance, and WELLSVM is also readily applicable on large data sets.
Li, N., Tsang, I. & Zhou, Z. 2013, 'Efficient Optimization of Performance Measures by Classifier Adaptation', IEEE Transactions On Pattern Analysis And Machine Intelligence, vol. 35, no. 6, pp. 1370-1382.
View/Download from: UTS OPUS or Publisher's site
In practical applications, machine learning algorithms are often needed to learn classifiers that optimize domain specific performance measures. Previously, the research has focused on learning the needed classifier in isolation, yet learning nonlinear classifier for nonlinear and nonsmooth performance measures is still hard. In this paper, rather than learning the needed classifier by optimizing specific performance measure directly, we circumvent this problem by proposing a novel two-step approach called CAPO, namely, to first train nonlinear auxiliary classifiers with existing learning methods and then to adapt auxiliary classifiers for specific performance measures. In the first step, auxiliary classifiers can be obtained efficiently by taking off-the-shelf learning algorithms. For the second step, we show that the classifier adaptation problem can be reduced to a quadratic program problem, which is similar to linear $({rm SVM}^{rm perf})$ and can be efficiently solved. By exploiting nonlinear auxiliary classifiers, CAPO can generate nonlinear classifier which optimizes a large variety of performance measures, including all the performance measures based on the contingency table and AUC, while keeping high computational efficiency. Empirical studies show that CAPO is effective and of high computational efficiency, and it is even more efficient than linear $({rm SVM}^{rm perf})$
Chen, B., Lam, W., Tsang, I. & Wong, T. 2013, 'Discovering Low-Rank Shared Concept Space for Adapting Text Mining Models', IEEE Transactions On Pattern Analysis And Machine Intelligence, vol. 35, no. 6, pp. 1284-1297.
View/Download from: UTS OPUS or Publisher's site
We propose a framework for adapting text mining models that discovers low-rank shared concept space. Our major characteristic of this concept space is that it explicitly minimizes the distribution gap between the source domain with sufficient labeled data and the target domain with only unlabeled data, while at the same time it minimizes the empirical loss on the labeled data in the source domain. Our method is capable of conducting the domain adaptation task both in the original feature space as well as in the transformed Reproducing Kernel Hilbert Space (RKHS) using kernel tricks. Theoretical analysis guarantees that the error of our adaptation model can be bounded with respect to the embedded distribution gap and the empirical loss in the source domain. We have conducted extensive experiments on two common text mining problems, namely, document classification and information extraction, to demonstrate the efficacy of our proposed framework.
Gao, S., Tsang, I. & Chia, L. 2013, 'Laplacian Sparse Coding, Hypergraph Laplacian Sparse Coding, and Applications', IEEE Transactions On Pattern Analysis And Machine Intelligence, vol. 35, no. 1, pp. 92-104.
View/Download from: UTS OPUS or Publisher's site
Sparse coding exhibits good performance in many computer vision applications. However, due to the overcomplete codebook and the independent coding process, the locality and the similarity among the instances to be encoded are lost. To preserve such locality and similarity information, we propose a Laplacian sparse coding (LSc) framework. By incorporating the similarity preserving term into the objective of sparse coding, our proposed Laplacian sparse coding can alleviate the instability of sparse codes. Furthermore, we propose a Hypergraph Laplacian sparse coding (HLSc), which extends our Laplacian sparse coding to the case where the similarity among the instances defined by a hypergraph. Specifically, this HLSc captures the similarity among the instances within the same hyperedge simultaneously, and also makes the sparse codes of them be similar to each other. Both Laplacian sparse coding and Hypergraph Laplacian sparse coding enhance the robustness of sparse coding. We apply the Laplacian sparse coding to feature quantization in Bag-of-Words image representation, and it outperforms sparse coding and achieves good performance in solving the image classification problem. The Hypergraph Laplacian sparse coding is also successfully used to solve the semi-auto image tagging problem. The good performance of these applications demonstrates the effectiveness of our proposed formulations in locality and similarity preservation
Xu, X., Tsang, I. & Xu, D. 2013, 'Soft Margin Multiple Kernel Learning', IEEE Transactions on Neural Networks and Learning Systems, vol. 24, no. 5, pp. 749-761.
View/Download from: UTS OPUS or Publisher's site
Multiple kernel learning (MKL) has been proposed for kernel methods by learning the optimal kernel from a set of predefined base kernels. However, the traditional L1MKL method often achieves worse results than the simplest method using the average of base kernels (i.e., average kernel) in some practical applications. In order to improve the effectiveness of MKL, this paper presents a novel soft margin perspective for MKL. Specifically, we introduce an additional slack variable called kernel slack variable to each quadratic constraint of MKL, which corresponds to one support vector machine model using a single base kernel. We first show that L1MKL can be deemed as hard margin MKL, and then we propose a novel soft margin framework for MKL. Three commonly used loss functions, including the hinge loss, the square hinge loss, and the square loss, can be readily incorporated into this framework, leading to the new soft margin MKL objective functions. Many existing MKL methods can be shown as special cases under our soft margin framework. For example, the hinge loss soft margin MKL leads to a new box constraint for kernel combination coefficients. Using different hyper-parameter values for this formulation, we can inherently bridge the method using average kernel, L1MKL, and the hinge loss soft margin MKL. The square hinge loss soft margin MKL unifies the family of elastic net constraint/regularizer based approaches; and the square loss soft margin MKL incorporates L2MKL naturally. Moreover, we also develop efficient algorithms for solving both the hinge loss and square hinge loss soft margin MKL. Comprehensive experimental studies for various MKL algorithms on several benchmark data sets and two real world applications, including video action recognition and event recognition demonstrate that our proposed algorithms can efficiently achieve an effective yet sparse solution for MKL.
Mao, Q. & Tsang, I. 2013, 'Efficient Multitemplate Learning for Structured Prediction', IEEE Transactions on Neural Networks and Learning Systems, vol. 24, no. 2, pp. 248-261.
View/Download from: UTS OPUS or Publisher's site
Conditional random fields (CRF) and structural support vector machines (structural SVM) are two state-of-the-art methods for structured prediction that captures the interdependencies among output variables. The success of these methods is attributed to the fact that their discriminative models are able to account for overlapping features on all input observations. These features are usually generated by applying a given set of templates on labeled data, but improper templates may lead to degraded performance. To alleviate this issue, in this paper we propose a novel multiple template learning paradigm to learn structured prediction and the importance of each template simultaneously, so that hundreds of arbitrary templates could be added into the learning model without caution. This paradigm can be formulated as a special multiple kernel learning problem with an exponential number of constraints. Then we introduce an efficient cutting-plane algorithm to solve this problem in the primal and present its convergence. We also evaluate the proposed learning paradigm on two widely studied structured prediction tasks, i.e., sequence labeling and dependency parsing. Extensive experimental results show that the proposed method outperforms CRFs and structural SVMs because of exploiting the importance of each template. Complexity analysis and empirical results also show that the proposed method is more efficient than Online multikernel learning on very sparse and high-dimensional data. We further extend this paradigm for structured prediction using generalized p-block norm regularization with p >; 1, and experiments show competitive performances when p ? [1,2)
Mao, Q., Tsang, I. & Gao, S. 2013, 'Objective-guided Image Annotation', IEEE Transactions On Image Processing, vol. 22, no. 4, pp. 1585-1597.
View/Download from: UTS OPUS or Publisher's site
Automatic image annotation, which is usually formulated as a multi-label classification problem, is one of the major tools used to enhance the semantic understanding of web images. Many multimedia applications (e.g., tag-based image retrieval) can greatly benefit from image annotation. However, the insufficient performance of image annotation methods prevents these applications from being practical. On the other hand, specific measures are usually designed to evaluate how well one annotation method performs for a specific objective or application, but most image annotation methods do not consider optimization of these measures, so that they are inevitably trapped into suboptimal performance of these objective-specific measures. To address this issue, we first summarize a variety of objective-guided performance measures under a unified representation. Our analysis reveals that macro-averaging measures are very sensitive to infrequent keywords, and hamming measure is easily affected by skewed distributions. We then propose a unified multi-label learning framework, which directly optimizes a variety of objective-specific measures of multi-label learning tasks. Specifically, we first present a multilayer hierarchical structure of learning hypotheses for multi-label problems based on which a variety of loss functions with respect to objective-guided measures are defined
Gao, S., Tsang, I. & Chia, L.T. 2013, 'Sparse Representation with Kernels', IEEE Transactions On Image Processing, vol. 22, no. 2, pp. 423-434.
View/Download from: UTS OPUS or Publisher's site
Recent research has shown the initial success of sparse coding (Sc) in solving many computer vision tasks. Motivated by the fact that kernel trick can capture the nonlinear similarity of features, which helps in finding a sparse representation of nonlinear features, we propose kernel sparse representation (KSR). Essentially, KSR is a sparse coding technique in a high dimensional feature space mapped by an implicit mapping function. We apply KSR to feature coding in image classification, face recognition, and kernel matrix approximation. More specifically, by incorporating KSR into spatial pyramid matching (SPM), we develop KSRSPM, which achieves a good performance for image classification. Moreover, KSR-based feature coding can be shown as a generalization of efficient match kernel and an extension of Sc-based SPM. We further show that our proposed KSR using a histogram intersection kernel (HIK) can be considered a soft assignment extension of HIK-based feature quantization in the feature coding process. Besides feature coding, comparing with sparse coding, KSR can learn more discriminative sparse codes and achieve higher accuracy for face recognition. Moreover, KSR can also be applied to kernel matrix approximation in large scale learning tasks, and it demonstrates its robustness to kernel matrix approximation, especially when a small fraction of the data is used. Extensive experimental results demonstrate promising results of KSR in image classification, face recognition, and kernel matrix approximation. All these applications prove the effectiveness of KSR in computer vision and machine learning tasks.
Seah, C.W., Ong, Y.S. & Tsang, I. 2013, 'Combating Negative Transfer from Predictive Distribution Differences', IEEE Transactions on Cybernetics, vol. 43, no. 4, pp. 1153-1165.
View/Download from: UTS OPUS or Publisher's site
Domain adaptation (DA), which leverages labeled data from related source domains, comes in handy when the label information of the target domain is scarce or unavailable. However, as the source data do not come from the same origin as that of the target domain, the predictive distributions of the source and target domains are likely to differ in reality. At the extreme, the predictive distributions of the source domains can differ completely from that of the target domain. In such case, using the learned source classifier to assist in the prediction of target data can result in prediction performance that is poorer than that with the omission of the source data. This phenomenon is established as negative transfer with impact known to be more severe in the multiclass context. To combat negative transfer due to differing predictive distributions across domains, we first introduce the notion of positive transferability for the assessment of synergy between the source and target domains in their prediction models, and we also propose a criterion to measure the positive transferability between sample pairs of different domains in terms of their prediction distributions. With the new measure, a predictive distribution matching (PDM) regularizer and a PDM framework learn the target classifier by favoring source data with large positive transferability while inferring the labels of target unlabeled data. Extensive experiments are conducted to validate the performance efficacy of the proposed PDM framework using several commonly used multidomain benchmark data sets, including Sentiment, Reuters, and Newsgroup, in the context of both binary-class and multiclass domains
Seah, C.W., Tsang, I. & Ong, Y.S. 2013, 'Transfer Ordinal Label Learning', IEEE Transactions on Neural Networks and Learning Systems, vol. 24, no. 11, pp. 1863-1876.
View/Download from: UTS OPUS or Publisher's site
Designing a classifier in the absence of labeled data is becoming a common encounter as the acquisition of informative labels is often difficult or expensive, particularly on new uncharted target domains. The feasibility of attaining a reliable classifier for the task of interest is embarked by some in transfer learning, where label information from relevant source domains is considered for complimenting the design process. The core challenge arising from such endeavors, however, is the induction of source sample selection bias, such that the trained classifier has the tendency of steering toward the distribution of the source domain. In addition, this bias is deemed to become more severe on data involving multiple classes. Considering this cue, our interest in this paper is to address such a challenge in the target domain, where ordinal labeled data are unavailable. In contrast to the previous works, we propose a transfer ordinal label learning paradigm to predict the ordinal labels of target unlabeled data by spanning the feasible solution space with ensemble of ordinal classifiers from the multiple relevant source domains. Specifically, the maximum margin criterion is considered here for the construction of the target classifier from an ensemble of source ordinal classifiers. Theoretical analysis and extensive empirical studies on real-world data sets are presented to study the benefits of the proposed method.
Tan, M., Tsang, I. & Wang, L. 2013, 'Minimax Sparse Logistic Regression for Very High Dimensional Feature Selection', IEEE Transactions on Neural Networks and Learning Systems, vol. 24, no. 10, pp. 1609-1622.
View/Download from: UTS OPUS or Publisher's site
Because of the strong convexity and probabilistic underpinnings, logistic regression (LR) is widely used in many real-world applications. However, in many problems, such as bioinformatics, choosing a small subset of features with the most discriminative power are desirable for interpreting the prediction model, robust predictions or deeper analysis. To achieve a sparse solution with respect to input features, many sparse LR models are proposed. However, it is still challenging for them to efficiently obtain unbiased sparse solutions to very high-dimensional problems (e.g., identifying the most discriminative subset from millions of features). In this paper, we propose a new minimax sparse LR model for very high-dimensional feature selections, which can be efficiently solved by a cutting plane algorithm. To solve the resultant nonsmooth minimax subproblems, a smoothing coordinate descent method is presented. Numerical issues and convergence rate of this method are carefully studied. Experimental results on several synthetic and real-world datasets show that the proposed method can obtain better prediction accuracy with the same number of selected features and has better or competitive scalability on very high-dimensional problems compared with the baseline methods, including the l1-regularized LR.
Duan, L., Xu, D. & Tsang, I. 2012, 'Domain Adaptation from Multiple Sources: A Domain-Dependent Regularization Approach', IEEE Transactions on Neural Networks and Learning Systems, vol. 23, no. 3, pp. 504-518.
View/Download from: UTS OPUS or Publisher's site
NA
Duan, L., Tsang, I. & Xu, D. 2012, 'Domain Transfer Multiple Kernel Learning', IEEE Transactions On Pattern Analysis And Machine Intelligence, vol. 34, no. 3, pp. 465-479.
View/Download from: UTS OPUS or Publisher's site
Cross-domain learning methods have shown promising results by leveraging labeled patterns from the auxiliary domain to learn a robust classifier for the target domain which has only a limited number of labeled samples. To cope with the considerable change between feature distributions of different domains, we propose a new cross-domain kernel learning framework into which many existing kernel methods can be readily incorporated. Our framework, referred to as Domain Transfer Multiple Kernel Learning (DTMKL), simultaneously learns a kernel function and a robust classifier by minimizing both the structural risk functional and the distribution mismatch between the labeled and unlabeled samples from the auxiliary and target domains. Under the DTMKL framework, we also propose two novel methods by using SVM and prelearned classifiers, respectively. Comprehensive experiments on three domain adaptation data sets (i.e., TRECVID, 20 Newsgroups, and email spam data sets) demonstrate that DTMKL-based methods outperform existing cross-domain learning and multiple kernel learning methods.
Seah, C.W., Tsang, I. & Ong, Y.S. 2012, 'Transductive Ordinal Regression', IEEE Transactions on Neural Networks and Learning Systems, vol. 23, no. 7, pp. 1074-1086.
View/Download from: UTS OPUS or Publisher's site
Ordinal regression is commonly formulated as a multiclass problem with ordinal constraints. The challenge of designing accurate classifiers for ordinal regression generally increases with the number of classes involved, due to the large number of labeled patterns that are needed. The availability of ordinal class labels, however, is often costly to calibrate or difficult to obtain. Unlabeled patterns, on the other hand, often exist in much greater abundance and are freely available. To take benefits from the abundance of unlabeled patterns, we present a novel transductive learning paradigm for ordinal regression in this paper, namely transductive ordinal regression (TOR). The key challenge of this paper lies in the precise estimation of both the ordinal class label of the unlabeled data and the decision functions of the ordinal classes, simultaneously. The core elements of the proposed TOR include an objective function that caters to several commonly used loss functions casted in transductive settings, for general ordinal regression. A label swapping scheme that facilitates a strictly monotonic decrease in the objective function value is also introduced. Extensive numerical studies on commonly used benchmark datasets including the real-world sentiment prediction problem are then presented to showcase the characteristics and efficacies of the proposed TOR. Further, comparisons to recent state-of-the-art ordinal regression methods demonstrate the introduced transductive learning paradigm for ordinal regression led to the robust and improved performance.
Chen, L., Tsang, I. & Xu, D. 2012, 'Laplacian Embedded Regression for Scalable Manifold Regularization', IEEE Transactions on Neural Networks and Learning Systems, vol. 23, no. 6, pp. 902-915.
View/Download from: UTS OPUS or Publisher's site
Semi-supervised learning (SSL), as a powerful tool to learn from a limited number of labeled data and a large number of unlabeled data, has been attracting increasing attention in the machine learning community. In particular, the manifold regularization framework has laid solid theoretical foundations for a large family of SSL algorithms, such as Laplacian support vector machine (LapSVM) and Laplacian regularized least squares (LapRLS). However, most of these algorithms are limited to small scale problems due to the high computational cost of the matrix inversion operation involved in the optimization problem. In this paper, we propose a novel framework called Laplacian embedded regression by introducing an intermediate decision variable into the manifold regularization framework. By using ?-insensitive loss, we obtain the Laplacian embedded support vector regression (LapESVR) algorithm, which inherits the sparse solution from SVR. Also, we derive Laplacian embedded RLS (LapERLS) corresponding to RLS under the proposed framework. Both LapESVR and LapERLS posses a simpler form of a transformed kernel, which is the summation of the original kernel and a graph kernel that captures the manifold structure. The benefits of the transformed kernel are two-fold: (1) we can deal with the original kernel matrix and the graph Laplacian matrix in the graph kernel separately and (2) if the graph Laplacian matrix is sparse, we only need to perform the inverse operation for a sparse matrix, which is much more efficient when compared with that for a dense one.
Chen, L., Xu, D., Tsang, I. & Luo, J. 2012, 'Tag-based Image Retrieval Improved by Augmented Features and Group based Refinement', IEEE Transactions On Multimedia, vol. 14, no. 4, pp. 1057-1067.
View/Download from: UTS OPUS or Publisher's site
In this paper, we propose a new tag-based image retrieval framework to improve the retrieval performance of a group of related personal images captured by the same user within a short period of an event by leveraging millions of training web images and their associated rich textual descriptions. For any given query tag (e.g., car), the inverted file method is employed to automatically determine the relevant training web images that are associated with the query tag and the irrelevant training web images that are not associated with the query tag. Using these relevant and irrelevant web images as positive and negative training data respectively, we propose a new classification method called support vector machine (SVM) with augmented features (AFSVM) to learn an adapted classifier by leveraging the prelearned SVM classifiers of popular tags that are associated with a large number of relevant training web images. Treating the decision values of one group of test photos from AFSVM classifiers as the initial relevance scores, in the subsequent group-based refinement process, we propose to use the Laplacian regularized least squares method to further refine the relevance scores of test photos by utilizing the visual similarity of the images within the group. Based on the refined relevance scores, our proposed framework can be readily applied to tag-based image retrieval for a group of raw consumer photos without any textual descriptions or a group of Flickr photos with noisy tags. Moreover, we propose a new method to better calculate the relevance scores for Flickr photos. Extensive experiments on two datasets demonstrate the effectiveness of our framework
Duan, L., Xu, D., Tsang, I. & Luo, J. 2012, 'Visual Event Recognition in Videos by Learning from Web Data', IEEE Transactions On Pattern Analysis And Machine Intelligence, vol. 34, no. 9, pp. 1667-1680.
View/Download from: UTS OPUS or Publisher's site
We propose a visual event recognition framework for consumer videos by leveraging a large amount of loosely labeled web videos (e.g., from YouTube). Observing that consumer videos generally contain large intraclass variations within the same type of events, we first propose a new method, called Aligned Space-Time Pyramid Matching (ASTPM), to measure the distance between any two video clips. Second, we propose a new transfer learning method, referred to as Adaptive Multiple Kernel Learning (A-MKL), in order to 1) fuse the information from multiple pyramid levels and features (i.e., space-time features and static SIFT features) and 2) cope with the considerable variation in feature distributions between videos from two domains (i.e., web video domain and consumer video domain). For each pyramid level and each type of local features, we first train a set of SVM classifiers based on the combined training set from two domains by using multiple base kernels from different kernel types and parameters, which are then fused with equal weights to obtain a prelearned average classifier. In A-MKL, for each event class we learn an adapted target classifier based on multiple base kernels and the prelearned average classifiers from this event class or all the event classes by minimizing both the structural risk functional and the mismatch between data distributions of two domains. Extensive experiments demonstrate the effectiveness of our proposed framework that requires only a small number of labeled consumer videos by leveraging web data.
Li, S., Tsang, I.W. & Chaudhari, N.S. 2012, 'Relevance vector machine based infinite decision agent ensemble learning for credit risk analysis', Expert Systems with Applications, vol. 39, no. 5, pp. 4947-4953.
View/Download from: Publisher's site
In this paper, a relevance vector machine based infinite decision agent ensemble learning (RVM Ideal) system is proposed for the robust credit risk analysis. In the first level of our model, we adopt soft margin boosting to overcome overfitting. In the second level, the RVM algorithm is revised for boosting so that different RVM agents can be generated from the updated instance space of the data. In the third level, the perceptron Kernel is employed in RVM to simulate infinite subagents. Our system RVM Ideal also shares some good properties, such as good generalization performance, immunity to overfitting and predicting the distance to default. According to the experimental results, our proposed system can achieve better performance in term of sensitivity, specificity and overall accuracy. © 2011 Elsevier Ltd. All rights reserved.
Li, S., Tan, M., Tsang, I. & Kwok, J. 2011, 'A Hybrid PSO-BFGS strategy for global optimization of multimodal functions', IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, vol. 41, no. 4, pp. 1003-1014.
View/Download from: UTS OPUS or Publisher's site
Particle swarm optimizer (PSO) is a powerful optimization algorithm that has been applied to a variety of problems. It can, however, suffer from premature convergence and slow convergence rate. Motivated by these two problems, a hybrid global optimization strategy combining PSOs with a modified Broyden-Fletcher-Goldfarb-Shanno (BFGS) method is presented in this paper. The modified BFGS method is integrated into the context of the PSOs to improve the particles' local search ability. In addition, in conjunction with the territory technique, a reposition technique to maintain the diversity of particles is proposed to improve the global search ability of PSOs. One advantage of the hybrid strategy is that it can effectively find multiple local solutions or global solutions to the multimodal functions in a box-constrained space. Based on these local solutions, a reconstruction technique can be adopted to further estimate better solutions. The proposed method is compared with several recently developed optimization algorithms on a set of 20 standard benchmark problems. Experimental results demonstrate that the proposed approach can obtain high-quality solutions on multimodal function optimization problems.
Liu, Y., Xu, D., Tsang, I. & Luo, J. 2011, 'Textual Query of Personal Photos Facilitated by Large-scale Web Data', IEEE Transactions On Pattern Analysis And Machine Intelligence, vol. 33, no. 5, pp. 1022-1036.
View/Download from: UTS OPUS or Publisher's site
The rapid popularization of digital cameras and mobile phone cameras has led to an explosive growth of personal photo collections by consumers. In this paper, we present a real-time textual query-based personal photo retrieval system by leveraging millions of Web images and their associated rich textual descriptions (captions, categories, etc.). After a user provides a textual query (e.g., "water), our system exploits the inverted file to automatically find the positive Web images that are related to the textual query "water as well as the negative Web images that are irrelevant to the textual query. Based on these automatically retrieved relevant and irrelevant Web images, we employ three simple but effective classification methods, k-Nearest Neighbor (kNN), decision stumps, and linear SVM, to rank personal photos. To further improve the photo retrieval performance, we propose two relevance feedback methods via cross-domain learning, which effectively utilize both the Web images and personal images. In particular, our proposed cross-domain learning methods can learn robust classifiers with only a very limited amount of labeled personal photos from the user by leveraging the prelearned linear SVM classifiers in real time. We further propose an incremental cross-domain learning method in order to significantly accelerate the relevance feedback process on large consumer photo databases. Extensive experiments on two consumer photo data sets demonstrate the effectiveness and efficiency of our system, which is also inherently not limited by any predefined lexicon.
Zhuang, J., Tsang, I. & Hoi, S.C. 2011, 'A Family of Simple Non-Parametric Kernel Learning Algorithms', Journal of Machine Learning Research, vol. 12, no. 4, pp. 1313-1347.
View/Download from: UTS OPUS
Previous studies of Non-Parametric Kernel Learning (NPKL) usually formulate the learning task as a Semi-Definite Programming (SDP) problem that is often solved by some general purpose SDP solvers. However, for N data examples, the time complexity of NPKL using a standard interior-point SDP solver could be as high as O(N(6.5)), which prohibits NPKL methods applicable to real applications, even for data sets of moderate size. In this paper, we present a family of efficient NPKL algorithms, termed "SimpleNPKL", which can learn non-parametric kernels from a large set of pairwise constraints efficiently. In particular, we propose two efficient SimpleNPKL algorithms. One is SimpleNPKL algorithm with linear loss, which enjoys a closed-form solution that can be efficiently computed by the Lanczos sparse eigen decomposition technique. Another one is SimpleNPKL algorithm with other loss functions (including square hinge loss, hinge loss, square loss) that can be re-formulated as a saddle-point optimization problem, which can be further resolved by a fast iterative algorithm. In contrast to the previous NPKL approaches, our empirical results show that the proposed new technique, maintaining the same accuracy, is significantly more efficient and scalable. Finally, we also demonstrate that the proposed new technique is also applicable to speed up many kernel learning tasks, including colored maximum variance unfolding, minimum volume embedding, and structure preserving embedding.
Pan, S.J., Tsang, I., Kwok, J. & Yang, Q. 2011, 'Domain Adaptation via Transfer Component Analysis', IEEE Transactions on Neural Networks, vol. 22, no. 2, pp. 199-210.
View/Download from: UTS OPUS or Publisher's site
Domain adaptation allows knowledge from a source domain to be transferred to a different but related target domain. Intuitively, discovering a good feature representation across domains is crucial. In this paper, we first propose to find such a representation through a new learning method, transfer component analysis (TCA), for domain adaptation. TCA tries to learn some transfer components across domains in a reproducing kernel Hilbert space using maximum mean miscrepancy. In the subspace spanned by these transfer components, data properties are preserved and data distributions in different domains are close to each other. As a result, with the new representations in this subspace, we can apply standard machine learning methods to train classifiers or regression models in the source domain for use in the target domain. Furthermore, in order to uncover the knowledge hidden in the relations between the data labels from the source and target domains, we extend TCA in a semisupervised learning setting, which encodes label information into transfer components learning. We call this extension semisupervised TCA. The main contribution of our work is that we propose a novel dimensionality reduction framework for reducing the distance between domains in a latent space for domain adaptation. We propose both unsupervised and semisupervised feature extraction approaches, which can dramatically reduce the distance between domain distributions by projecting data onto the learned transfer components. Finally, our approach can handle large datasets and naturally lead to out-of-sample generalization. The effectiveness and efficiency of our approach are verified by experiments on five toy datasets and two real-world applications: cross-domain indoor WiFi localization and cross-domain text classification.
Duan, L., Li, W., Tsang, I. & Xu, D. 2011, 'Improving Web Image Search by Bag-based Re-ranking', IEEE Transactions On Image Processing, vol. 20, no. 11, pp. 3280-3290.
View/Download from: UTS OPUS or Publisher's site
Given a textual query in traditional text-based image retrieval (TBIR), relevant images are to be reranked using visual features after the initial text-based search. In this paper, we propose a new bag-based reranking framework for large-scale TBIR. Specifically, we first cluster relevant images using both textual and visual features. By treating each cluster as a bag and the images in the bag as instances, we formulate this problem as a multi-instance (MI) learning problem. MI learning methods such as mi-SVM can be readily incorporated into our bag-based reranking framework. Observing that at least a certain portion of a positive bag is of positive instances while a negative bag might also contain positive instances, we further use a more suitable generalized MI (GMI) setting for this application. To address the ambiguities on the instance labels in the positive and negative bags under this GMI setting, we develop a new method referred to as GMI-SVM to enhance retrieval performance by propagating the labels from the bag level to the instance level. To acquire bag annotations for (G)MI learning, we propose a bag ranking method to rank all the bags according to the defined bag ranking score. The top ranked bags are used as pseudopositive training bags, while pseudonegative training bags can be obtained by randomly sampling a few irrelevant images that are not associated with the textual query. Comprehensive experiments on the challenging real-world data set NUS-WIDE demonstrate our framework with automatic bag annotation can achieve the best performances compared with existing image reranking methods. Our experiments also demonstrate that GMI-SVM can achieve better performances when using the manually labeled training bags obtained from relevance feedback.
Nie, F., Zeng, Z., Xu, D., Tsang, I. & Zhang, C. 2011, 'Spectral Embedded Clustering: A Framework for In-Sample and Out-of-Sample Spectral Clustering', IEEE Transactions on Neural Networks, vol. 22, no. 11, pp. 1796-1808.
View/Download from: UTS OPUS or Publisher's site
Spectral clustering (SC) methods have been successfully applied to many real-world applications. The success of these SC methods is largely based on the manifold assumption, namely, that two nearby data points in the high-density region of a low-dimensional data manifold have the same cluster label. However, such an assumption might not always hold on high-dimensional data. When the data do not exhibit a clear low-dimensional manifold structure (e.g., high-dimensional and sparse data), the clustering performance of SC will be degraded and become even worse than K -means clustering. In this paper, motivated by the observation that the true cluster assignment matrix for high-dimensional data can be always embedded in a linear space spanned by the data, we propose the spectral embedded clustering (SEC) framework, in which a linearity regularization is explicitly added into the objective function of SC methods. More importantly, the proposed SEC framework can naturally deal with out-of-sample data. We also present a new Laplacian matrix constructed from a local regression of each pattern and incorporate it into our SEC framework to capture both local and global discriminative information for clustering. Comprehensive experiments on eight real-world high-dimensional datasets demonstrate the effectiveness and advantages of our SEC framework over existing SC methods and K-means-based clustering methods. Our SEC framework significantly outperforms SC using the Nystro¨m algorithm on unseen data.
Tjhi, W.C., Lee, K.K., Hung, T., Tsang, I.W., Ong, Y.S., Bard, F. & Racine, V. 2011, 'Exploratory analysis of cell-based screening data for phenotype identification in drug-siRNA study.', International journal of computational biology and drug design, vol. 4, no. 2, pp. 194-215.
Most phenotype-identification methods in cell-based screening assume prior knowledge about expected phenotypes or involve intricate parameter-setting. They are useful for analysis targeting known phenotype properties; but need exists to explore, with minimum presumptions, the potentially-interesting phenotypes derivable from data. We present a method for this exploration, using clustering to eliminate phenotype-labelling requirement and GUI visualisation to facilitate parameter-setting. The steps are: outlier-removal, cell clustering and interactive visualisation for phenotypes refinement. For drug-siRNA study, we introduce an auto-merging procedure to reduce phenotype redundancy. We validated the method on two Golgi apparatus screens and showcase its contribution for better understanding of screening-images.
Zhong, W., Pan, W., Kwok, J. & Tsang, I. 2010, 'Incorporating the Loss Function into Discriminative Clustering of Structured Outputs', IEEE Transactions on Neural Networks, vol. 21, no. 10, pp. 1564-1575.
View/Download from: UTS OPUS or Publisher's site
Clustering using the Hilbert Schmidt independence criterion (CLUHSIC) is a recent clustering algorithm that maximizes the dependence between cluster labels and data observations according to the Hilbert Schmidt independence criterion (HSIC). It is unique in that structure information on the cluster outputs can be easily utilized in the clustering process. However, while the choice of the loss function is known to be very important in supervised learning with structured outputs, we will show in this paper that CLUHSIC is implicitly using the often inappropriate zero-one loss. We propose an extension called CLUHSICAL (which stands for Clustering using HSIC and loss) which explicitly considers both the output dependency and loss function. Its optimization problem has the same form as CLUHSIC, except that its partition matrix is constructed in a different manner. Experimental results on a number of datasets with structured outputs show that CLUHSICAL often outperforms CLUHSIC in terms of both structured loss and clustering accuracy.
Nie, F., Xu, D., Tsang, I. & Zhang, C. 2010, 'Flexible Manifold Embedding: A Framework for Semi-supervised and Unsupervised Dimension Reduction', IEEE Transactions On Image Processing, vol. 19, no. 7, pp. 1921-1932.
View/Download from: UTS OPUS or Publisher's site
Mak, B., Lai, T.C., Tsang, I. & Kwok, J. 2009, 'Maximum Penalized Likelihood Kernel Regression for Fast Adaptation', IEEE Transactions on Audio, Speech and Language Processing, vol. 17, no. 7, pp. 1372-1381.
View/Download from: UTS OPUS or Publisher's site
This paper proposes a nonlinear generalization of the popular maximum-likelihood linear regression (MLLR) adaptation algorithm using kernel methods. The proposed method, called maximum penalized likelihood kernel regression adaptation (MPLKR), applies kernel regression with appropriate regularization to determine the affine model transform in a kernel-induced high-dimensional feature space. Although this is not the first attempt of applying kernel methods to conventional linear adaptation algorithms, unlike most of other kernelized adaptation methods such as kernel eigenvoice or kernel eigen-MLLR, MPLKR has the advantage that it is a convex optimization and its solution is always guaranteed to be globally optimal. In fact, the adapted Gaussian means can be obtained analytically by simply solving a system of linear equations. From the Bayesian perspective, MPLKR can also be considered as the kernel version of maximum a posteriori linear regression (MAPLR) adaptation. Supervised and unsupervised speaker adaptation using MPLKR were evaluated on the Resource Management and Wall Street Journal 5K tasks, respectively, achieving a word error rate reduction of 23.6% and 15.5% respectively over the speaker-independently model.
Zhang, K., Tsang, I. & Kwok, J. 2009, 'Maximum Margin Clustering Made Practical', IEEE Transactions on Neural Networks, vol. 20, no. 4, pp. 583-596.
View/Download from: UTS OPUS or Publisher's site
Motivated by the success of large margin methods in supervised learning, maximum margin clustering (MMC) is a recent approach that aims at extending large margin methods to unsupervised learning. However, its optimization problem is nonconvex and existing MMC methods all rely on reformulating and relaxing the nonconvex optimization problem as semidefinite programs (SDP). Though SDP is convex and standard solvers are available, they are computationally very expensive and only small data sets can be handled. To make MMC more practical, we avoid SDP relaxations and propose in this paper an efficient approach that performs alternating optimization directly on the original nonconvex problem. A key step to avoid premature convergence in the resultant iterative procedure is to change the loss function from the hinge loss to the Laplacian/square loss so that overconfident predictions are penalized. Experiments on a number of synthetic and real-world data sets demonstrate that the proposed approach is more accurate, much faster (hundreds to tens of thousands of times faster), and can handle data sets that are hundreds of times larger than the largest data set reported in the MMC literature.
Tsang, I., Kocsor, A. & Kwok, J. 2008, 'Large-Scale Maximum Margin Discriminant Analysis Using Core Vector Machines', IEEE Transactions on Neural Networks, vol. 19, no. 4, pp. 610-624.
View/Download from: UTS OPUS or Publisher's site
Large-Scale Maximum Margin Discriminant Analysis Using Core Vector Machines
Kwok, J.T., Tsang, I.W. & Zurada, J.M. 2007, 'A class of single-class minimax probability machines for novelty detection.', IEEE transactions on neural networks / a publication of the IEEE Neural Networks Council, vol. 18, no. 3, pp. 778-785.
Single-class minimax probability machines (MPMs) offer robust novelty detection with distribution-free worst case bounds on the probability that a pattern will fall inside the normal region. However, in practice, they are too cautious in labeling patterns as outlying and so have a high false negative rate (FNR). In this paper, we propose a more aggressive version of the single-class MPM that bounds the best case probability that a pattern will fall inside the normal region. These two MPMs can then be used together to delimit the solution space. By using the hyperplane lying in the middle of this pair of MPMs, a better compromise between false positives (FPs) and false negatives (FNs), and between recall and precision can be obtained. Experiments on the real-world data sets show encouraging results.
Park, J., Kang, D., Kim, J., Kwok, J.T. & Tsang, I.W. 2007, 'SVDD-based pattern denoising.', Neural computation, vol. 19, no. 7, pp. 1919-1938.
The support vector data description (SVDD) is one of the best-known one-class support vector learning methods, in which one tries the strategy of using balls defined on the feature space in order to distinguish a set of normal data from all other possible abnormal objects. The major concern of this letter is to extend the main idea of SVDD to pattern denoising. Combining the geodesic projection to the spherical decision boundary resulting from the SVDD, together with solving the preimage problem, we propose a new method for pattern denoising. We first solve SVDD for the training data and then for each noisy test pattern, obtain its denoised feature by moving its feature vector along the geodesic on the manifold to the nearest decision boundary of the SVDD ball. Finally we find the location of the denoised pattern by obtaining the pre-image of the denoised feature. The applicability of the proposed method is illustrated by a number of toy and real-world data sets.
Tsang, I.W., Kwok, J.T. & Zurada, J.M. 2006, 'Generalized core vector machines.', IEEE transactions on neural networks / a publication of the IEEE Neural Networks Council, vol. 17, no. 5, pp. 1126-1140.
Kernel methods, such as the support vector machine (SVM), are often formulated as quadratic programming (QP) problems. However, given m training patterns, a naive implementation of the QP solver takes O(m3) training time and at least O(m2) space. Hence, scaling up these QPs is a major stumbling block in applying kernel methods on very large data sets, and a replacement of the naive method for finding the QP solutions is highly desirable. Recently, by using approximation algorithms for the minimum enclosing ball (MEB) problem, we proposed the core vector machine (CVM) algorithm that is much faster and can handle much larger data sets than existing SVM implementations. However, the CVM can only be used with certain kernel functions and kernel methods. For example, the very popular support vector regression (SVR) cannot be used with the CVM. In this paper, we introduce the center-constrained MEB problem and subsequently extend the CVM algorithm. The generalized CVM algorithm can now be used with any linear/nonlinear kernel and can also be applied to kernel methods such as SVR and the ranking SVM. Moreover, like the original CVM, its asymptotic time complexity is again linear in m and its space complexity is independent of m. Experiments show that the generalized CVM has comparable performance with state-of-the-art SVM and SVR implementations, but is faster and produces fewer support vectors on very large data sets.
Tsang, I.W. & Kwok, J.T. 2006, 'Efficient hyperkernel learning using second-order cone programming.', IEEE transactions on neural networks / a publication of the IEEE Neural Networks Council, vol. 17, no. 1, pp. 48-58.
The kernel function plays a central role in kernel methods. Most existing methods can only adapt the kernel parameters or the kernel matrix based on empirical data. Recently, Ong et al. introduced the method of hyperkernels which can be used to learn the kernel function directly in an inductive setting. However, the associated optimization problem is a semidefinite program (SDP), which is very computationally expensive, even with the recent advances in interior point methods. In this paper, we show that this learning problem can be equivalently reformulated as a second-order cone program (SOCP), which can then be solved more efficiently than SDPs. Comparison is also made with the kernel matrix learning method proposed by Lanckriet et aL Experimental results on both classification and regression problems, with toy and real-world data sets, show that our proposed SOCP formulation has significant speedup over the original SDP formulation. Moreover, it yields better generalization than Lanckriet et al.'s method, with a speed that is comparable, or sometimes even faster, than their quadratically constrained quadratic program (QCQP) formulation.
Tsang, I., Kwok, J. & Cheung, P. 2005, 'Core vector machines: Fast SVM training on very large data sets', Journal of Machine Learning Research, vol. 6, pp. 363-392.
Li, S., Kwok, J.T., Tsang, I.W. & Wang, Y. 2004, 'Fusing images with different focuses using support vector machines.', IEEE transactions on neural networks / a publication of the IEEE Neural Networks Council, vol. 15, no. 6, pp. 1555-1561.
Many vision-related processing tasks, such as edge detection, image segmentation and stereo matching, can be performed more easily when all objects in the scene are in good focus. However, in practice, this may not be always feasible as optical lenses, especially those with long focal lengths, only have a limited depth of field. One common approach to recover an everywhere-in-focus image is to use wavelet-based image fusion. First, several source images with different focuses of the same scene are taken and processed with the discrete wavelet transform (DWT). Among these wavelet decompositions, the wavelet coefficient with the largest magnitude is selected at each pixel location. Finally, the fused image can be recovered by performing the inverse DWT. In this paper, we improve this fusion procedure by applying the discrete wavelet frame transform (DWFT) and the support vector machines (SVM). Unlike DWT, DWFT yields a translation-invariant signal representation. Using features extracted from the DWFT coefficients, a SVM is trained to select the source image that has the best focus at each pixel location, and the corresponding DWFT coefficients are then incorporated into the composite wavelet representation. Experimental results show that the proposed method outperforms the traditional approach both visually and quantitatively.
Kwok, J.T. & Tsang, I.W. 2004, 'The pre-image problem in kernel methods.', IEEE transactions on neural networks / a publication of the IEEE Neural Networks Council, vol. 15, no. 6, pp. 1517-1525.
In this paper, we address the problem of finding the pre-image of a feature vector in the feature space induced by a kernel. This is of central importance in some kernel applications, such as on using kernel principal component analysis (PCA) for image denoising. Unlike the traditional method which relies on nonlinear optimization, our proposed method directly finds the location of the pre-image based on distance constraints in the feature space. It is noniterative, involves only linear algebra and does not suffer from numerical instability or local minimum problems. Evaluations on performing kernel PCA and kernel clustering on the USPS data set show much improved performance.
Kwok, J.T. & Tsang, I.W. 2003, 'Linear dependency between and the input noise in -support vector regression', IEEE Transactions on Neural Networks, vol. 14, no. 3, pp. 544-553.
View/Download from: Publisher's site
In using the -support vector regression (-SVR) algorithm, one has to decide a suitable value for the insensitivity parameter . Smola et al. considered its "optimal" choice by studying the statistical efficiency in a location parameter estimation problem. While they successfully predicted a linear scaling between the optimal and the noise in the data, their theoretically optimal value does not have a close match with its experimentally observed counterpart in the case of Gaussian noise. In this paper, we attempt to better explain their experimental results by studying the regression problem itself. Our resultant predicted choice of is much closer to the experimentally observed optimal value, while again demonstrating a linear trend with the input noise.