UTS site search

Dr Guandong Xu

Biography

Dr Guandong Xu is a senior lecturer in the Advanced Analytics Institute at University of Technology Sydney. He received MSc and BSc degree in Computer Science and Engineering from Zhejiang University, China. He gained PhD degree in Computer Science from Victoria University. After that he took various positions, e.g., Postdoctoral research fellow and Vice-Chancellor Postdoctoral Fellow in the Centre for Applied Informatics at Victoria University, Australia, and Research Assistant Professor in Department of Computer Science at Aalborg University, Denmark. He is an Endeavour Postdoctoral Research Fellow in the University of Tokyo in 2008.

Professional

Guandong has had 80+ publications in the areas of Web Data Mining, Recommender System, Social Web and Social Network Analysis and Applied Informatics. He has authored three monograph books and one edited book, and edited five conference proceedings with Springer, Taylor & Francis, and IGI publisher along with dozens of journal and conference papers.

He has been serving in editorial board or as guest editors for several international journals, such as the Computer Journal, Journal of Systems and Software, World Wide Web Journal and International Journal of Social Network Mining, and he is the assistant Editor-in-Chief of World Wide Web Journal. He is also active in organizing or serving for international conferences and workshops, e.g., ASONAM 2014 and BESC 2014.

Image of Guandong Xu
Lecturer, A/DRsch Advanced Analytics Institute
Core Member, Advanced Analytics Institute
B.Sc(ZJU), M.Sc(ZJU), PhD
Member, Institute of Electrical and Electronics Engineers
Member, Association for Computing Machinery
 
Phone
+61 2 9514 3788
Room
CB21.GD.05A

Research Interests

  • Data mining, Machine learning
  • Web usage mining, Web community, Web personalization and Recommender System
  • Information retrieval and processing, Web search
  • Social network analysis, Social media mining, Social Analytics

Can supervise: Yes
Registered at Level 1

Database, Data Analytics, Text Analytics, Recommender Systems

Book Chapters

Xu, G., Gu, Y. & Yi, X. 2013, 'On Group Extraction and Fusion for Tag-Based Social Recommendation' in Guandong Xu and Lin Li (eds), Social Media Mining and Social Network Analysis: Emerging Research, IGI Global, Hershey, USA, pp. 211-223.
View/Download from: UTSePress |
With the recent information explosion, social websites have become popular in many Web 2.0 applications where social annotation services allow users to annotate various resources with freely chosen words, i.e., tags, which can facilitate users+ finding preferred resources. However, obtaining the proper relationship among user, resource, and tag is still a challenge in social annotation-based recommendation researches. In this chapter, the authors aim to utilize the affinity relationship between tags and resources and between tags and users to extract group information. The key idea is to obtain the implicit relationship groups among users, resources, and tags and then fuse them to generate recommendation. The authors experimentally demonstrate that their strategy outperforms the state-of-the-art algorithms that fail to consider the latent relationships among tagging data.
Zong, Y. & Xu, G. 2013, 'Clustering Algorithms for Tags' in Guandong Xu and Lin Li (eds), Social Media Mining and Social Network Analysis: Emerging Research, IGI Global, Hershey, USA, pp. 39-53.
View/Download from: UTSePress |
With the development and application of social media, more and more user-generated contents are created. Tag data, a kind of typical user generated content, has attracted lots of interests of researchers. In general, tags are the freely chosen textual descriptions by users to label digital data sources in social tagging systems. Poor retrieval performance remains a major problem of most social tagging systems resulting from the severe difficulty of ambiguity, redundancy, and less semantic nature of tags. Clustering method is a useful tool to increase the ability of information retrieval in the aforementioned systems. In this chapter, the authors (1) review the background of state-of-the-art tagging clustering and the tag data description, (2) present five kinds of tag similarity measurements proposed by researchers, and (3) finally propose a new clustering algorithm for tags based on local information that is derived from Kernel function. This chapter aims to benefit both academic and industry communities who are interested in the techniques and applications of tagging clustering
Li, L., Xiao, H. & Xu, G. 2013, 'Recommending Related Microblogs' in Guandong Xu and Lin Li (eds), Social Media Mining and Social Network Analysis: Emerging Research, IGI Global, Hershey, USA, pp. 202-210.
View/Download from: UTSePress |
Computing similarity between short microblogs is an important step in microblog recommendation. In this chapter, the authors utilize three kinds of approaches+traditional term-based approach, WordNet-based semantic approach, and topic-based approach+to compute similarities between micro-blogs and recommend top related ones to users. They conduct experimental study on the effectiveness of the three approaches in terms of precision. The results show that WordNet-based semantic similarity approach has a relatively higher precision than that of the traditional term-based approach, and the topic-based approach works poorest with 548 tweets as the dataset. In addition, the authors calculated the Kendall tau distance between two lists generated by any two approaches from WordNet, term, and topic approaches. Its average of all the 548 pair lists tells us the WordNet-based and term-based approach have generally high agreement in the ranking of related tweets, while the topic-based approach has a relatively high disaccord in the ranking of related tweets with the WordNet-based approach.

Books

Xu, G. & Li, L. 2013, Social Media Mining and Social Network Analysis: Emerging Research, 1st, IGI Global, Hershey, USA.
Social Media Mining and Social Network Analysis: Emerging Research highlights the advancements made in social network analysis and social web mining and its influence in the fields of computer science, information systems, sociology, organization science discipline and much more. This collection of perspectives on developmental practice is useful for industrial practitioners as well as researchers and scholars.
Xu, G., Zong, Y. & Yang, Z. 2013, Applied Data Mining, 1st, CRC Press, USA.
Luo, T., Chen, S., Xu, G. & Zhou, J. 2013, Trust-based Collective View Prediction, 1st, Springer Berlin / Heidelberg, Germany.
View/Download from: UTSePress | Publisher's site
Collective view prediction is to judge the opinions of an active web user based on unknown elements by referring to the collective mind of the whole community. Content-based recommendation and collaborative filtering are two mainstream collective view prediction techniques. They generate predictions by analyzing the text features of the target object or the similarity of users+ past behaviors. Still, these techniques are vulnerable to the artificially-injected noise data, because they are not able to judge the reliability and credibility of the information sources. Trust-based Collective View Prediction describes new approaches for tackling this problem by utilizing users+ trust relationships from the perspectives of fundamental theory, trust-based collective view prediction algorithms and real case studies. The book consists of two main parts + a theoretical foundation and an algorithmic study. The first part will review several basic concepts and methods related to collective view prediction, such as state-of-the-art recommender systems, sentimental analysis, collective view, trust management, the Relationship of Collective View and Trustworthy, and trust in collective view prediction. In the second part, the authors present their models and algorithms based on a quantitative analysis of more than 300 thousand users+ data from popular product-reviewing websites. They also introduce two new trust-based prediction algorithms, one collaborative algorithm based on the second-order Markov random walk model, and one Bayesian fitting model for combining multiple predictors.
Xu, G., Zhang, Y. & Li, L. 2011, Web Mining and Social Networking - Techniques and Applications, 1st, Springer Berlin / Heidelberg, Germany.
View/Download from: UTSePress
This book examines the techniques and applications involved in the Web Mining, Web Personalization and Recommendation and Web Community Analysis domains, including a detailed presentation of the principles, developed algorithms, and systems of the research in these areas. The applications of web mining, and the issue of how to incorporate web mining into web personalization and recommendation systems are also reviewed. Additionally, the volume explores web community mining and analysis to find the structural, organizational and temporal developments of web communities and reveal the societal sense of individuals or communities. The volume will benefit both academic and industry communities interested in the techniques and applications of web search, web data management, web mining and web knowledge discovery, as well as web community and social network analysis.

Conference Papers

Hu, L., Cao, J., Xu, G., Cao, L., Gu, Z. & Zhu, C. 2013, 'Personalized recommendation via cross-domain triadic factorization', International World Wide Web Conference, Rio de Janeiro, Brazil, May 2013 in Proceedings of the 22nd international conference on World Wide Web WWW'13, ed Daniel Schwabe, Virglio A. F. Almeida, Hartmut Glaser, Ricardo A. Baeza-Yates, Sue B. Moon, ACM, The United States, pp. 595-606.
View/Download from: UTSePress |
Collaborative filtering (CF) is a major technique in recommender systems to help users find their potentially desired items. Since the data sparsity problem is quite commonly encountered in real-world scenarios, Cross-Domain Collaborative Filtering (CDCF) hence is becoming an emerging research topic in recent years. However, due to the lack of sufficient dense explicit feedbacks and even no feedback available in users' uninvolved domains, current CDCF approaches may not perform satisfactorily in user preference prediction. In this paper, we propose a generalized Cross Domain Triadic Factorization (CDTF) model over the triadic relation user-item-domain, which can better capture the interactions between domain-specific user factors and item factors. In particular, we devise two CDTF algorithms to leverage user explicit and implicit feedbacks respectively, along with a genetic algorithm based weight parameters tuning algorithm to trade off influence among domains optimally. Finally, we conduct experiments to evaluate our models and compare with other state-of-the-art models by using two real world datasets. The results show the superiority of our models against other comparative models
Hu, L., Cao, J., Xu, G., Wang, J., Gu, Z. & Cao, L. 2013, 'Cross-Domain Collaborative Filtering via Bilinear Multilevel Analysis', International Joint Conference on Artificial Intelligence, Beijing, China, August 2013 in Proceedings of the 23rd International Joint Conference on Artificial Intelligence, ed Francesca Rossi, IJCAI/AAAI, The United States, pp. 2626-2632.
View/Download from: UTSePress |
Cross-domain collaborative filtering (CDCF), which aims to leverage data from multiple domains to relieve the data sparsity issue, is becoming an emerging research topic in recent years. However, current CDCF methods that mainly consider user and item factors but largely neglect the heterogeneity of domains may lead to improper knowledge transfer issues. To address this problem, we propose a novel CDCF model, the Bilinear Multilevel Analysis (BLMA), which seamlessly introduces multilevel analysis theory to the most successful collaborative filtering method, matrix factorization (MF). Specifically, we employ BLMA to more efficiently address the determinants of ratings from a hierarchical view by jointly considering domain, community, and user effects so as to overcome the issues caused by traditional MF approaches. Moreover, a parallel Gibbs sampler is provided to learn these effects. Finally, experiments conducted on a real-world dataset demonstrate the superiority of the BLMA over other state-of-the-art methods.
Fu, B., Xu, G., Wang, Z. & Cao, L. 2013, 'Leveraging Supervised Label Dependency Propagation for Multi-label Learning', International Conference on Data Mining, Dallas, TX, USA, December 2013 in 2013 IEEE 13th International Conference on Data Mining, ed Hui Xiong, George Karypis, Bhavani M. Thuraisingham, Diane J. Cook, Xindong Wu, IEEE, The United States, pp. 1061-1066.
View/Download from: UTSePress | Publisher's site
Exploiting label dependency is a key challenge in multi-label learning, and current methods solve this problem mainly by training models on the combination of related labels and original features. However, label dependency cannot be exploited dynamically and mutually in this way. Therefore, we propose a novel paradigm of leveraging label dependency in an iterative way. Specifically, each label's prediction will be updated and also propagated to other labels via an random walk with restart process. Meanwhile, the label propagation is implemented as a supervised learning procedure via optimizing a loss function, thus more appropriate label dependency can be learned. Extensive experiments are conducted, and the results demonstrate that our method can achieve considerable improvements in terms of several evaluation metrics.
Yi, X., Paulet, R., Bertino, E. & Xu, G. 2013, 'Private data warehouse queries', ACM Symposium on Access Control Models and Technologies, Amsterdam, June 2014 in Proceedings of the 18th ACM Symposium on Access Control Models and Technologies, ed Mauro Conti, Jaideep Vaidya, Andreas Schaad, ACM, US, pp. 25-36.
View/Download from: UTSePress | Publisher's site
Publicly accessible data warehouses are an indispensable resource for data analysis. But they also pose a significant risk to the privacy of the clients, since a data warehouse operator may follow the client's queries and infer what the client is interested in. Private Information Retrieval (PIR) techniques allow the client to retrieve a cell from a data warehouse without revealing to the operator which cell is retrieved. However, PIR cannot be used to hide OLAP operations performed by the client, which may disclose the client's interest. This paper presents a solution for private data warehouse queries on the basis of the Boneh-Goh-Nissim cryptosystem which allows one to evaluate any multi-variate polynomial of total degree 2 on ciphertexts. By our solution, the client can perform OLAP operations on the data warehouse and retrieve one (or more) cell without revealing any information about which cell is selected. Furthermore, our solution supports some types of statistical analysis on data warehouse, such as regression and variance analysis, without revealing the client's interest. Our solution ensures both the server's security and the client's security.
Xu, G. & Wu, Z. 2012, 'On Smart and Accurate Contextual Advertising', Database Systems for Advanced Applications, Busan, South Korea, April 2012 in Lecture Notes in Computer Science, ed NA, Springer Berlin / Heidelberg, Germany, pp. 104-104.
View/Download from: UTSePress | Publisher's site
Wide Web to attract customers, has become one of the most important marketing channels. As one prevalent type ofWeb advertising, contextual advertising refers to the placement of the most relevant commercial ads into the content of a Web page, so as to increase the number of adclicks. However, some problems such as homonymy and polysemy, low intersection of keywords, and context mismatch, can lead to the selection of irrelevant ads for a generic page, making that the traditional keyword matching techniques generally present a poor accuracy. Furthermore, existing contextual advertising techniques only take into consideration how to select as relevant ads for a generic page as possible, without considering the positional effect of the ad placement in the page. In this paper, we propose a new contextual advertising framework to tackle problems, which (1) usesWikipedia concept and category information to enrich the semantic representation of a page (or a textual ad) and (2) takes the placement position of embedded advertise into account. To accomplish these steps, we first map each page (or ad) into three feature vectors: a keyword vector, a concept vector and a category vector. Second, we determine the relevant ads for a given page based on a similarity measure which combines the above three feature vectors. In dealing with position-wise contextual advertising, the relevant ads are selected based on not only global context relevance but also local context relevance, so that the embedded ads yield contextual relevance to both the whole targeted page and the insertion positions where the ads are placed. We experimentally validate our approach by using a real ads set, a real pages set , and a set of more than 260,000 concepts and 12,000 categories from Wikipedia. The experimental results show that our approach performs better than the simple keyword matching and can improve the precision of ads-selection effectively.
Fu, B., Wang, Z., Pan, R., Xu, G. & Dolog, P. 2012, 'Learning Tree Structure of Label Dependency for Multi-label Learning', Pacific-Asia Conference on Knowledge Discovery and Data Mining, Kuala Lumpur, Malaysia, May 2012 in Advances in Knowledge Discovery and Data Mining Lecture Notes in Computer Science, ed NA, Springer Berlin / Heidelberg, Germany, pp. 159-170.
View/Download from: UTSePress | Publisher's site
There always exists some kind of label dependency in multi- label data. Learning and utilizing those dependencies could improve the learning performance further. Therefore, an approach for multi-label learning is proposed in this paper, which quantifies the dependencies of pairwise labels firstly, and then builds a tree structure of the labels to describe them. Thus the approach could find out potential strong la- bel dependencies and produce more generalized dependent relationships. The experimental results have validated that compared with other state- of-the-art algorithms, the method is not only a competitive alternative, but also has shown better performance after ensemble learning especially.
Zong, Y., Xu, G., Jin, P., Zhang, Y., Chen, E. & Pan, R. 2011, 'APPECT: An Approximate Backbone-Based Clustering Algorithm for Tags', International Conference on Advanced Data Mining and Applications, Beijing, China, December 2011 in Advanced Data Mining and Applications, Lecture Notes in Computer Science, ed NA, Springer Berlin / Heidelberg, Germany, pp. 175-189.
View/Download from: UTSePress | Publisher's site
In social annotation systems, users label digital resources by using tags which are freely chosen textual descriptions. Tags are used to index, anno- tate and retrieve resource as an additional metadata of resource . Poor retrieval performance remains a major problem of most social tagging systems resulting from the severe difficulty of ambiguity, redundancy and less semantic nature of tags. Clustering method is a useful tool to address the aforementioned difficul- ties. Most of the researches on tag cluste ring are directly using traditional clus- tering algorithms such as K-means or Hierarchical Agglomerative Clustering on tagging data, which possess the inherent drawbacks, such as the sensitivity of initialization. In this paper, we instead make use of the approximate backbone of tag clustering results to find out better tag clusters. In particular, we propose an APProximate backbonE-based Clustering algorithm for Tags (APPECT). The main steps of APPECT are: (1) we execute the K-means algorithm on a tag similarity matrix for M times and collect a set of tag clustering results Z={C 1 ,C 2 ,...,C m } ; (2) we form the approximate backbone of Z by executing a greedy search; (3) we fix the approximate backbone as the initial tag clustering result and then assign the rest tags into the corresponding clusters based on the similarity. Experimental results on three real world datasets namely MedWorm, MovieLens and Dmoz demonstrate the effectiveness and the superiority of the proposed method against the traditional approaches.
Zong, Y., Xu, G., Jin, P., Dolog, P. & Jiang, S. 2011, 'A Local Information Passing Clustering Algorithm for Tagging Systems', Database Systems for Advanced Applications, Hong Kong, China, April 2011 in Database Systems for Adanced Applications, Lecture Notes in Computer Science, ed NA, Springer Berlin / Heidelberg, Germany, pp. 333-343.
View/Download from: UTSePress | Publisher's site
Under social tagging systems, a typical Web2.0 application, users label digital data sources by using tags which are freely chosen textual descriptions. Tags are used to index, annotate and retrieve resource as an additional metadata of resource. Poor retrieval performance remains a major problem of most social tagging systems resulting from the severe difficulty of ambiguity, redundancy and less semantic nature of tags. Clustering method is a useful tool to increase the ability of information retrieval in the aforementioned systems. In this paper, we propose a novel clustering algorithm named LIPC (Local Information Passing Clustering algorithm). The main steps of LIPC are: (1) we estimate a KNN neighbor directed graph G of tags and calculate the kernel density of each tag in its neighborhood; (2) we generate local information, local coverage and local kernel of each tag; (3) we pass the local information on G by I and O operators until they are converged and tag priory are generated; (4) we use tag priory to find out the clusters of tags. Experimental results on two real world datasets namely MedWorm and MovieLens demonstrate the efficiency and the superiority of the proposed method.
Xu, G., Zong, Y., Pan, R., Dolog, P. & Jin, P. 2011, 'On Kernel Information Propagation for Tag Clustering in Social Annotation Systems', International Conference on Knowledge-Based and Intelligent Information and Engineering Systems, Kaiserslautern, Germany, September 2014 in Knowlege-Based and Intelligent Information and Engineering Systems Lecture Notes in Computer Science, ed NA, Springer Berlin / Heidelberg, Germany, pp. 505-514.
View/Download from: UTSePress | Publisher's site
In social annotation systems, users label digital resources by using tags which are freely chosen textual descriptors. Tags are used to index, annotate and retrieve resource as an additional metadata of re- source. Poor retrieval performance remains a major challenge of most social annotation systems resulting from the severe problems of ambigu- ity, redundancy and less semantic nature of tags. Clustering method is a useful approach to handle these problems in the social annotation sys- tems. In this paper, we propose a novel clustering algorithm named kernel information propagation for tag clustering. This approach makes use of the kernel density estimation of the KNN neighbor directed graph as a start to reveal the prestige rank of tags in tagging data. The random walk with restart algorithm is then employed to determine the center points of tag clusters. The main strength of the proposed approach is the capability of partitioning tags from the perspective of tag prestige rank rather than the intuitive similarity calculation itself. Experimental studies on three real world datasets demonstrate the effectiveness and superiority of the proposed method.
Xu, G., Gu, Y., Zhang, Y., Yang, Z. & Kitsuregawa, M. 2011, 'TOAST: A Topic-Oriented Tag-Based Recommender System', International Conference on Web Information Systems Engineering, Sydney, Australia, October 2011 in Web Information System Engineering + WISE 2011 Lecture Notes in Computer Science, ed NA, Springer Berlin / Heidelberg, Germany, pp. 158-171.
View/Download from: UTSePress | Publisher's site
Social Annotation Systems have emerged as a popular application with the advance of Web 2.0 technologies. Tags generated by users using arbitrary words to express their own opinions and perceptions on various resources provide a new intermediate dimension between users and resources, which deemed to convey the user preference information. Using clustering for topic extraction and incorporating it with the capture of user preference and resource affiliation is becoming an effective practice in tag-based recommender systems. In this paper, we aim to address these challenges via a topic graph approach. We first propose a Topic Oriented Graph (TOG), which models the user preference and resource affiliation on various topics. Based on the graph, we devise a Topic-Oriented Tag-based Recommendation System (TOAST) by using the preference propagation on the graph. We conduct experiments on two real datasets to demonstrate that our approach outperforms other state-of-the-art algorithms.
Xu, G., Zong, Y., Dolog, P. & Zhang, Y. 2010, 'Co-clustering Analysis of Weblogs Using Bipartite Spectral Projection Approach', International Conference on Knowledge-Based and Intelligent Information and Engineering Systems, Cardiff, UK, September 2014 in Knowledge-Based and Intelligent Information and Engineering Systems Lecture Notes in Computer Science, ed NA, Springer Berlin / Heidelberg, Germany, pp. 398-407.
View/Download from: UTSePress | Publisher's site
Web clustering is an approach for aggregating Web objects into various groups according to underlying relationships among them. Finding co-clusters of Web objects is an interesting topic in the context of Web usage mining, which is able to capture the underlying user navigational interest and content preference simultaneously. In this paper we will present an algorithm using bipartite spectral clustering to cocluster Web users and pages. The usage data of users visiting Web sites is modeled as a bipartite graph and the spectral clustering is then applied to the graph representation of usage data. The proposed approach is evaluated by experiments performed on real datasets, and the impact of using various clustering algorithms is also investigated. Experimental results have demonstrated the employed method can effectively reveal the subset aggregates of Web users and pages which are closely related.
Zong, Y., Xu, G., Dolog, P., Zhang, Y. & Liu, R. 2010, 'Co-clustering for Weblogs in Semantic Space', International Conference on Web Information Systems Engineering, Hong Kong, China, December 2014 in Web Information Systems Engineering + WISE 2010 Lecture Notes in Computer Science, ed NA, Springer Berlin / Heidelberg, Germany, pp. 120-127.
View/Download from: UTSePress | Publisher's site
Web clustering is an approach for aggregating web objects into various groups according to underlying relationships among them. Finding co-clusters of web objects in semantic space is an interesting topic in the context of web usage mining, which is able to capture the underlying user navigational interest and content preference simultane- ously. In this paper we will present a novel web co-clustering algorithm named Co-Clustering in Semantic space (COCS) to simultaneously par- tition web users and pages via a latent semantic analysis approach. In COCS, we first, train the latent semantic space of weblog data by using Probabilistic Latent Semantic Analysis (PLSA) model, and then, project all weblog data objects into this semantic space with probability distribu- tion to capture the relationship among web pages and web users, at last, propose a clustering algorithm to generate the co-cluster corresponding to each semantic factor in the latent semantic space via probability in- ference. The proposed approach is evaluated by experiments performed on real datasets in terms of precision and recall metrics. Experimental results have demonstrated the proposed method can effectively reveal the co-aggregates of web users and pages which are closely related.
Li, L., Xu, G., Zhang, Y. & Kitsuregawa, M. 2009, 'Enhancing Web Search by Aggregating Results of Related Web Queries', International Conference on Web Information Systems Engineering, Poznan, Poland, October 2014 in Web Information Systems Engineering - WISE 2009 Lecture Notes in Computer Science, ed NA, Springer Berlin / Heidelberg, Germany, pp. 203-217.
View/Download from: UTSePress | Publisher's site
Currently, commercial search engines have implemented methods to suggest alternative Web queries to users, which helps them specify alternative related queries in pursuit of finding needed Web pages. In this paper, we address the Web search problem on related queries to improve retrieval quality by devising a novel search rank aggregation mechanism. Given an initial query and the suggested related queries, our search system concurrently processes their search result lists from an existing search engine and then forms a single list aggregated by all the retrieved lists. In particular we propose a generic rank aggregation framework which considers not only the number of wins that an item won in a competition, but also the quality of its competitor items in calculat- ing the ranking of Web items. The framework combines the traditional and random walk based rank aggregation methods to produce a more reasonable list to users. Experimental results show that the proposed approach can clearly improve the retrieval quality in a parallel man- ner over the traditional search strategy that serially returns result lists. Moreover, we also empirically investigate how different rank aggregation methods affect the retrieval performance.
Thongkam, J., Xu, G., Zhang, Y. & Huang, F. 2008, 'Support Vector Machine for Outlier Detection in Breast Cancer Survivability Prediction', Asia Pacific Web Conference, Shenyang, China, April 2008 in Advanced Web and NetworkTechnologies, and Applications Lecture Notes in Computer Science, ed NA, Springer Berlin / Heidelberg, Germany, pp. 99-109.
View/Download from: UTSePress | Publisher's site
Finding and removing misclassified instances are important steps in data mining and machine learning that affect the performance of the data mining algorithm in general. In this paper, we propose a C-Support Vector Classification Filter (C-SVCF) to identify and remove the misclassified instances (outliers) in breast cancer survivability samples collected from Srinagarind hospital in Thai- land, to improve the accuracy of the prediction models. Only instances that are correctly classified by the filter are passed to the learning algorithm. Perform- ance of the proposed technique is measured with accuracy and area under the re- ceiver operating characteristic curve (AUC), as well as compared with several popular ensemble filter approaches including AdaBoost, Bagging and ensemble of SVM with AdaBoost and Bagging filters. Our empirical results indicate that C-SVCF is an effective method for identifying misclassified outliers. This ap- proach significantly benefits ongoing research of developing accurate and robust prediction models for breast cancer survivability.

Journal Articles

Chen, X., Liu, L., Luo, D., Xu, G., Lu, Y., Liu, M. & Gao, R. 2014, 'A Spectral Clustering Algorithm Based on Hierarchical Method', Lecture Notes in Computer Science, vol. 8316, no. 1, pp. 111-123.
View/Download from: UTSePress | Publisher's site
Most of the clustering algorithms were designed to cluster the data in convex spherical sample space, but their ability was poor for clustering more complex structures. In the past few years, several spectral clustering algorithms were proposed to cluster arbitrarily shaped data in various real applications including image processing and web analysis. However, most of these algorithms were based on k-means, which is a randomized algorithm and makes the algorithm easy to fall into local optimal solutions. Hierarchical method could handle the local optimum well because it organizes data into different groups at different levels. In this paper, we propose a novel clustering algorithm called spectral clustering algorithm based on hierarchical clustering (SCHC), which combines the advantages of hierarchical clustering and spectral clustering algorithms to avoid the local optimum issues. The experiments on both synthetic data sets and real data sets show that SCHC outperforms other six popular clustering algorithms. The method is simple but is shown to be efficient in clustering both convex shaped data and arbitrarily shaped data.
Fu, B., Wang, Z., Xu, G. & Cao, L. 2014, 'Multi-label learning based on iterative label propagation over graph', Pattern Recognition Letters, vol. 42, pp. 85-90.
View/Download from: Publisher's site
Xu, G. 2014, 'Expanding user's query with tag-neighbors for effective medical information retrieval', Multimedia Tools and Applications, vol. 71, no. 2, pp. 905-929.
View/Download from: Publisher's site
You, Y., Xu, G., Cao, J., Zhang, Y. & Huang, G. 2013, 'Leveraging visual features and hierarchical dependencies for conference information extraction', Lecture Notes in Computer Science, vol. 7808, no. 1, pp. 404-416.
View/Download from: UTSePress | Publisher's site
Traditional information extraction methods mainly rely on visual feature assisted techniques; but without considering the hierarchical dependencies within the paragraph structure, some important information is missing. This paper proposes an integrated a
Wu, Z., Xu, G., Lu, C., Chen, E.X., Zhang, Y. & Zhang, H. 2013, 'Position-wise contextual advertising: Placing relevant ads at appropriate positions of a web page', Neurocomputing, vol. 120, no. 1, pp. 524-535.
View/Download from: Publisher's site
Web advertising, a form of online advertising, which uses the Internet as a medium to post product or service information and attract customers, has become one of the most important marketing channels. As one prevalent type of web advertising, contextual
Xu, G., Yu, J.X. & Lee, W. 2013, 'Social networks and social Web mining', World Wide Web-Internet And Web Information Systems, vol. 16, no. 5-6, pp. 541-544.
View/Download from: UTSePress | Publisher's site
NA
Li, F., Xu, G., Cao, L., Fan, X. & Niu, Z. 2013, 'CGMF: Coupled Group-Based Matrix Factorization for Recommender System', Lecture Notes in Computer Science, vol. 8180, no. 1, pp. 289-298.
View/Download from: UTSePress | Publisher's site
With the advent of social influence, social recommender systems have become an active research topic for making recommendations based on the ratings of the users that have close social relations with the given user. The underlying assumption is that a user+s taste is similar to his/her friends+ in social networking. In fact, users enjoy different groups of items with different preferences. A user may be treated as trustful by his/her friends more on some specific rather than all groups. Unfortunately, most of the extant social recommender systems are not able to differentiate user+s social influence in different groups, resulting in the unsatisfactory recommendation results. Moreover, most extant systems mainly rely on social relations, but overlook the influence of relations between items. In this paper, we propose an innovative coupled group-based matrix factorization model for recommender system by leveraging the user and item groups learned by topic modeling and incorporating couplings between users and items and within users and items. Experiments conducted on publicly available data sets demonstrate the effectiveness of our approach.
Li, L., Xu, G., Yang, Z., Dolog, P., Zhang, Y. & Kitsuregawa, M. 2013, 'An efficient approach to suggesting topically related web queries using hidden topic model', World Wide Web, vol. 16, no. 3, pp. 273-297.
View/Download from: UTSePress | Publisher's site
Keyword-based Web search is a widely used approach for locating information on the Web. However, Web users usually suffer from the difficulties of organizing and formulating appropriate input queries due to the lack of sufficient domain knowledge, which greatly affects the search performance. An effective tool to meet the information needs of a search engine user is to suggest Web queries that are topically related to their initial inquiry. Accurately computing query-to-query similarity scores is a key to improve the quality of these suggestions. Because of the short lengths of queries, traditional pseudo-relevance or implicit-relevance based approaches expand the expression of the queries for the similarity computation. They explicitly use a search engine as a complementary source and directly extract additional features (such as terms or URLs) from the top-listed or clicked search results. In this paper, we propose a novel approach by utilizing the hidden topic as an expandable feature. This has two steps. In the offline model-learning step, a hidden topic model is trained, and for each candidate query, its posterior distribution over the hidden topic space is determined to re-express the query instead of the lexical expression. In the online query suggestion step, after inferring the topic distribution for an input query in a similar way, we then calculate the similarity between candidate queries and the input query in terms of their corresponding topic distributions; and produce a suggestion list of candidate queries based on the similarity scores. Our experimental results on two real data sets show that the hidden topic based suggestion is much more efficient than the traditional term or URL based approach, and is effective in finding topically related queries for suggestion.
Li, X., Zhang, L., Chen, E., Zong, Y. & Xu, G. 2013, 'Mining Frequent Patterns in Print Logs with Semantically Alternative Labels', Lecture Notes in Computer Science, vol. 8347, pp. 107-119.
View/Download from: UTSePress | Publisher's site
It is common today for users to print the informative information from webpages due to the popularity of printers and internet. Thus, many web printing tools such as Smart Print and PrintUI are developed for online printing. In order to improve the users+ printing experience, the interaction data between users and these tools are collected to form a so-called print log data, where each record is the set of urls selected for printing by a user within a certain period of time. Apparently, mining frequent patterns from these print log data can capture user intentions for other applications, such as printing recommendation and behavior targeting. However, mining frequent patterns by directly using url as item representation in print log data faces two challenges: data sparsity and pattern interpretability. To tackle these challenges, we attempt to leverage delicious api (a social bookmarking web service) as an external thesaurus to expand the semantics of each url by selecting tags associated with the domain of each url. In this setting, the frequent pattern mining is employed on the tag representation of each url rather than the url or domain representation. With the enhancement of semantically alternative tag representation, the semantics of url is substantially improved, thus yielding the useful frequent patterns. To this end, in this paper we propose a novel pattern mining problem, namely mining frequent patterns with semantically alternative labels, and propose an efficient algorithm named PaSAL (Frequent Patterns with Semantically Alternative Labels Mining Algorithm) for this problem. Specifically, we propose a new constraint named conflict matrix to purify the redundant patterns to achieve a high efficiency. Finally, we evaluate the proposed algorithm on a real print log data.
Wu, L., Chin, A., Xu, G., Du, L., Wang, X., Meng, K., Guo, Y. & Zhou, Y. 2013, 'Who Will Follow Your Shop? Exploiting Multiple Information Sources in Finding Followers', Lecture Notes in Computer Science, vol. 7826, no. 1, pp. 401-415.
View/Download from: UTSePress | Publisher's site
WuXianGouXiang is an O2O(offline to online and vice versa)-based mobile application that recommends the nearby coupons and deals for users, by which users can also follow the shops they are interested in. If the potential followers of a shop can be discovered, the merchant+s targeted advertising can be more effective and the recommendations for users will also be improved. In this paper, we propose to predict the link relations between users and shops based on the following behavior. In order to better model the characteristics of the shops, we first adopt Topic Modeling to analyze the semantics of their descriptions and then propose a novel approach, named INtent Induced Topic Search (INITS) to update the hidden topics of the shops with and without a description. In addition, we leverage the user logs and search engine results to get the similarity between users and shops. Then we adopt the latent factor model to calculate the similarity between users and shops, in which we use the multiple information sources to regularize the factorization. The experimental results demonstrate that the proposed approach is effective for detecting followers of the shops and the INITS model is useful for shop topic inference.
Cuzzocrea, A., Moussa, R. & Xu, G. 2013, 'OLAP*: Effectively and Efficiently Supporting Parallel OLAP over Big Data', Lecture Notes in Computer Science, vol. 8216, no. 1, pp. 38-49.
View/Download from: UTSePress | Publisher's site
In this paper, we investigate solutions relying on data partitioning schemes for parallel building of OLAP data cubes, suitable to novel Big Data environments, and we propose the framework OLAP*, along with the associated benchmark TPC-H*d, a suitable transformation of the well-known data warehouse benchmark TPC-H. We demonstrate through performance measurements the efficiency of the proposed framework, developed on top of the ROLAP server Mondrian
Wang, Z., Luo, T., Xu, G. & Wang, X. 2013, 'A New Indexing Technique for Supporting By-attribute Membership Query of Multidimensional Data', Lecture Notes in Computer Science, vol. 7901, no. 1, pp. 266-277.
View/Download from: UTSePress | Publisher's site
Multidimensional Data indexing and lookup has been widely used in online data-intensive applications involving in data with multiple attributes. However, there remains a long way to go for the high performance multi-attribute data representation and lookup: the performance of index drops down with the increase of dimensions. In this paper, we present a novel data structure called Bloom Filter Matrix (BFM) to support multidimensional data indexing and by-attribute search. The proposed matrix is based on the Cartesian product of different bloom filters, each representing one attribute of the original data. The structure and parameter of each bloom filter is designed to fit the actual data characteristic and system demand, enabling fast object indexing and lookup, especially by-attribute search of multidimensional data. Experiments show that Bloom Filter Matrix is a fast and accurate data structure for multi-attribute data indexing and by-attribute search with high-correlated queries
Wu, Z., Yin, W., Cao, J., Xu, G. & Cuzzocrea, A. 2013, 'Community Detection in Multi-relational Social Networks', Lecture Notes in Computer Science, vol. 8181, no. 1, pp. 43-56.
View/Download from: UTSePress | Publisher's site
Multi-relational networks are ubiquitous in many fields such as bibliography, twitter, and healthcare. There have been many studies in the literature targeting at discovering communities from social networks. However, most of them have focused on single-relational networks. A hint of methods detected communities from multi-relational networks by converting them to single-relational networks first. Nevertheless, they commonly assumed different relations were independent from each other, which is obviously unreal to real-life cases. In this paper, we attempt to address this challenge by introducing a novel co-ranking framework, named MutuRank. It makes full use of the mutual influence between relations and actors to transform the multi-relational network to the single-relational network. We then present GMM-NK (Gaussian Mixture Model with Neighbor Knowledge) based on local consistency principle to enhance the performance of spectral clustering process in discovering overlapping communities. Experimental results on both synthetic and real-world data demonstrate the effectiveness of the proposed method.
Liu, L., Chen, X., Luo, D., Lu, Y., Xu, G. & Liu, M. 2013, 'HSC: A spectral clustering algorithm combined with hierarchical method', Neural Network World, vol. 23, no. 6, pp. 499-521.
View/Download from: UTSePress |
Most of the traditional clustering algorithms are poor for clustering more complex structures other than the convex spherical sample space. In the past few years, several spectral clustering algorithms were proposed to cluster arbitrarily shaped data in various real applications. However, spectral clustering relies on the dataset where each cluster is approximately well separated to a certain extent. In the case that the cluster has an obvious inflection point within a non-convex space, the spectral clustering algorithm would mistakenly recognize one cluster to be different clusters. In this paper, we propose a novel spectral clustering algorithm called HSC combined with hierarchical method, which obviates the disadvantage of the spectral clustering by not using the misleading information of the noisy neighboring data points. The simple clustering procedure is applied to eliminate the misleading information, and thus the HSC algorithm could cluster both convex shaped data and arbitrarily shaped data more efficiently and accurately. The experiments on both synthetic data sets and real data sets show that HSC outperforms other popular clustering algorithms. Furthermore, we observed that HSC can also be used for the estimation of the number of clusters
Wu, Z., Xu, G., Zhang, Y., Dolog, P. & Lu, C. 2012, 'An Improved Contextual Advertising Matching Approach Based On Wikipedia Knowledge', Computer Journal, vol. 55, no. 3, pp. 277-292.
View/Download from: UTSePress | Publisher's site
The current boom of the Web is associated with the revenues originated from Web advertising. As one prevalent type of Web advertising, contextual advertising refers to the placement of the most relevant commercial textual ads within the content of a Web
Liu, L., Fan, D., Liu, M. & Xu, G. 2012, 'A MapReduce-Based Parallel Clustering Algorithm for Large Protein-Protein Interaction Networks', Lecture Notes in Computer Science, vol. 0302-9743, pp. 138-148.
Clustering proteins or identifying functionally related proteins in Protein-Protein Interaction (PPI) networks is one of the most computation-intensive problems in the proteomic community. Most researches focused on improving the accuracy of the clustering algorithms. However, the high computation cost of these clustering algorithms, such as Girvan and Newmans clustering algorithm, has been an obstacle to their use on large-scale PPI networks. In this paper, we propose an algorithm, called Clustering-MR, to address the problem. Our solution can effectively parallelize the Girvan and Newmans clustering algorithms based on edge-betweeness using Map Reduce. We evaluated the performance of our Clustering-MR algorithm in a cloud environment with different sizes of testing datasets and different numbers of worker nodes. The experimental results show that our Clustering-MR algorithm can achieve high performance for large-scale PPI networks with more than 1000 proteins or 5000 interactions
Wu, Z., Xu, G., Zhang, Y., Cao, Z., Li, G. & Hu, Z. 2012, 'GMQL: A graphical multimedia query language', Knowledge-based Systems, vol. 26, pp. 135-143.
View/Download from: UTSePress | Publisher's site
The rapid increase of multimedia data makes multimedia query more and more important. To better satisfy users+ query requirements, developing a functional multimedia query language is becoming a promising and interesting task. In this paper, we propose a graphical multimedia query language called GMQL, which is developed based on a semi-structured data organization model. In GMQL, we combine the advantages of graphs and texts, making the query language much clear, easy to use and with powerful expressiveness. In this paper, we first present the notations and basic capabilities of GMQL by query examples. Second, we discuss the GMQL query processing techniques. Last, we evaluate and analyze our multimedia query language through the comparison with other existing multimedia query languages. The evaluation results show that, GMQL has powerful expressiveness, and thus is much applicable for multimedia information retrieval.
Li, L., Zhong, L., Xu, G. & Kitsuregawa, M. 2012, 'A feature-free search query classification approach using semantic distance', Expert Systems with Applications, vol. 39, no. 12, pp. 10739-10748.
View/Download from: UTSePress | Publisher's site
When classifying search queries into a set of target categories, machine learning based conventional approaches usually make use of external sources of information to obtain additional features for search queries and training data for target categories. Unfortunately, these approaches rely on large amount of training data for high classification precision. Moreover, they are known to suffer from inability to adapt to different target categories which may be caused by the dynamic changes observed in both Web topic taxonomy and Web content. In this paper, we propose a feature-free classification approach using semantic distance. We analyze queries and categories themselves and utilizes the number of Web pages containing both a query and a category as a semantic distance to determine their similarity. The most attractive feature of our approach is that it only utilizes the Web page counts estimated by a search engine to provide the search query classification with respectable accuracy. In addition, it can be easily adaptive to the changes in the target categories, since machine learning based approaches require extensive updating process, e.g., re-labeling outdated training data, re-training classifiers, to name a few, which is time consuming and high-cost. We conduct experimental study on the effectiveness of our approach using a set of rank measures and show that our approach performs competitively to some popular state-of-the-art solutions which, however, frequently use external sources and are inherently insufficient in flexibility.
Wu, Z., Xu, G., Zong, Y., Yi, X., Chen, E. & Zhang, Y. 2012, 'Executing SQL queries over encrypted character strings in the Database-As-Service model', Knowledge-based Systems, vol. 35, pp. 332-348.
View/Download from: UTSePress | Publisher's site
Zong, Y., Xu, G., Jin, P., Zhang, Y. & Chen, E. 2011, 'HC_AB: A new heuristic clustering algorithm based on Approximate Backbone', Information Processing Letters, vol. 111, no. 17, pp. 857-863.
View/Download from: Publisher's site
Clustering is an important research area with numerous applications in pattern recognition, machine learning, and data mining. Since the clustering problem on numeric data sets can be formulated as a typical combinatorial optimization problem, many researches have addressed the design of heuristic algorithms for finding sub-optimal solutions in a reasonable period of time. However, most of the heuristic clustering algorithms suffer from the problem of being sensitive to the initialization and do not guarantee the high quality results. Recently, Approximate Backbone (AB), i.e., the commonly shared intersection of several sub-optimal solutions, has been proposed to address the sensitivity problem of initialization. In this paper, we aim to introduce the AB into heuristic clustering to overcome the initialization sensitivity of conventional heuristic clustering algorithms. The main advantage of the proposed method is the capability of restricting the initial search space around the optimal result by defining the AB, and in turn, reducing the impact of initialization on clustering, eventually improving the performance of heuristic clustering. Experiments on synthetic and real world data sets are performed to validate the effectiveness of the proposed approach in comparison to three conventional heuristic clustering algorithms and three other algorithms with improvement on initialization
Li, L., Xu, G., Zhang, Y. & Kitsuregawa, M. 2011, 'Random walk based rank aggregation to improving web search', Knowledge-based Systems, vol. 24, no. 7, pp. 943-951.
View/Download from: UTSePress | Publisher's site
In Web search, with the aid of related query recommendation, Web users can revise their initial queries in several serial rounds in pursuit of finding needed Web pages. In this paper, we address the Web search problem on aggregating search results of related queries to improve the retrieval quality. Given an initial query and the suggested related queries, our search system concurrently processes their search result lists from an existing search engine and then forms a single list aggregated by all the retrieved lists. We specifically propose a generic rank aggregation framework which consists of three steps. First we build a so-called Win/Loss graph of Web pages according to a competition rule, and then apply the random walk mechanism on the Win/Loss graph. Last we sort these Web pages by their ranks using a PageRank-like rank mechanism. The proposed framework considers not only the number of wins that an item won in competitions, but also the quality of its competitor items in calculating the ranking of Web page items. Experimental results show that our search system can clearly improve the retrieval quality in a parallel manner over the traditional search strategy that serially returns result lists. Moreover, we also provide empirical evidences as to demonstrate how different rank aggregation methods affect the retrieval quality.
Xu, G., Li, L., Zhang, Y., Yi, X. & Kitsuregawa, M. 2011, 'Modeling user hidden navigational behavior for Web recommendation', Web Intelligence and Agent Systems-An international journal, vol. 9, no. 3, pp. 239-255.
View/Download from: UTSePress |
Web users exhibit a variety of navigational interests through clicking a sequence of Web pages. Analyses of Web usage data will lead to discovering Web user access patterns, and in turn, facilitating users to locate more preferable Web contents via collaborative recommendation techniques. In the context of Web usage mining, Latent Semantic Analysis (LSA) based on probability inference provides a promising approach to capture not only user hidden navigational patterns, but also the associations between users, pages and hidden navigational patterns. The discovered user access patterns could be used as a usage reference base for identifying the new user+s access preferences and making usage-based collaborative Web recommendations. In this paper, we propose a novel usage-based Web recommendation framework, in which Latent Dirichlet Allocation (LDA) is employed to learn the underlying task space from the training Web log data and infer the task distribution for a target user via task inference. The main advantages of the adapted LDA model are its capabilities of efficiently learning the semantic usage information from the Web log data and effectively inferring the access preference of the target user even with a few Web clicks that might be unseen in the training data. In this paper, we also investigate the determination of an optimizing task number, which is another important problem commonly encountered in latent semantic analysis. Experiments conducted on a real Web log dataset show that this approach can achieve better recommendation performance in comparison to other existing techniques. And the discovered task-simplex expression can also provide a better interpretation for Web designers or users to better understand the user navigational preference.
Zong, Y., Xu, G., Zhang, Y., Jiang, H. & Li, M. 2010, 'A Robust Iterative Refinement Clustering Algorithm With Smoothing Search Space', Knowledge-based Systems, vol. 23, no. 5, pp. 389-396.
View/Download from: UTSePress | Publisher's site
Iterative refinement clustering algorithms are widely used in data mining area, but they are sensitive to the initialization. In the past decades, many modified initialization methods have been proposed to reduce the influence of initialization sensitivity problem. The essence of iterative refinement clustering algorithms is the local search method. The big numbers of the local minimum points which are embedded in the search space make the local search problem hard and sensitive to the initialization. The smaller number of local minimum points, the more robust of initialization for a local search algorithm is. In this paper, we propose a Top+Down Clustering algorithm with Smoothing Search Space (TDCS3) to reduce the influence of initialization. The main steps of TDCS3 are to: (1) dynamically reconstruct a series of smoothed search spaces into a hierarchical structure by `filling+ the local minimum points; (2) at the top level of the hierarchical structure, an existing iterative refinement clustering algorithm is run with random initialization to generate the clustering result; (3) eventually from the second level to the bottom level of the hierarchical structure, the same clustering algorithm is run with the initialization derived from the previous clustering result. Experiment results on 3 synthetic and 10 real world data sets have shown that TDCS3 has significant effects on finding better, robust clustering result and reducing the impact of initialization.
Zhang, Y. & Xu, G. 2009, 'On Web Communities Mining And Recommendation', Concurrency And Computation-practice & Experience, vol. 21, no. 5, pp. 561-582.
View/Download from: UTSePress | Publisher's site
Because of the lack of a uniform schema for web documents and the sheer amount and dynamics of web data, both the effectiveness and the efficiency of information management and retrieval of web data are often unsatisfactory when using conventional data m
Thongkam, J., Xu, G., Zhang, Y. & Huang, F. 2009, 'Toward breast cancer survivability prediction models through improving training space', Expert Systems with Applications, vol. 36, no. 10, pp. 12200-12209.
View/Download from: UTSePress | Publisher's site
Due to the difficulties of outlier and skewed data, the prediction of breast cancer survivability has presented many challenges in the field of data mining and pattern precognition, especially in medical research. To solve these problems, we have proposed a hybrid approach to generating higher quality data sets in the creation of improved breast cancer survival prediction models. This approach comprises two main steps: (1) utilization of an outlier filtering approach based on C-Support Vector Classification (C-SVC) to identify and eliminate outlier instances; and (2) application of an over-sampling approach using over-sampling with replacement to increase the number of instances in the minority class. In order to assess the capability and effectiveness of the proposed approach, several measurement methods including basic performance (e.g., accuracy, sensitivity, and specificity), Area Under the receiver operating characteristic Curve (AUC) and F-measure were utilized. Moreover, a 10-fold cross-validation method was used to reduce the bias and variance of the results of breast cancer survivability prediction models. Results have indicated that the proposed approach leads to improving the performance of breast cancer survivability prediction models by up to 28.34% due to the improved training data space.