UTS site search

Professor Chengqi Zhang

Biography

Chengqi Zhang has been appointed by the University of Technology Sydney (UTS) as Executive Director UTS Data Science on 15 September 2016 to look after all researches in Data Science area cross UTS. He has been appointed as a Research Professor of Information Technology at UTS in December 2001, and since March 2015 he has held an appointment as an Honorary Professor at the University of Queensland (UQ). In April 2008, he accepted an appointment as the founding Director of the UTS Priority Research Centre for Quantum Computation & Intelligent Systems (QCIS), and was appointed as the Alternative Dean of UTS Graduate Research School since March 2013. He has been the Chairman of the Australian Computer Society National Committee for Artificial Intelligence since November 2005, and the Chairman of IEEE Computer Society Technical Committee of Intelligent Informatics (TCII) since June 2014.  

Prof. Zhang obtained his Bachelor degree from Fudan University in 1982, Master degree from Jilin University in 1985, PhD degree from The University of Queensland  in 1991, followed by a Doctor of Science (DSc – Higher Doctorate) from Deakin University in 2002, all in Computer Science.

Prof. Zhang’s research interests mainly focus on Data Mining and its applications. He has published nearly 300 research papers, including a number of papers in the first-class international journals, such as Artificial Intelligence, IEEE and ACM Transactions. He has published seven monographs and edited 16 books. He has delivered 16 keynote/invited speeches at international conferences. He has attracted 12 Australian Research Council grants. Due to his outstanding research achievements, he had been awarded 2011 NSW Science and Engineering Awards in Engineering, Information and Communications Technology category.

Since Prof. Zhang was appointed as the Director of QCIS eight years ago, he has led the Centre people to publish 702 high quality papers, triple the national grant funding, and improve the grant ranking of UTS-IT to very high place from all 38 universities in Australia.  His Centre attracted nine ARC Future Fellows in last eight rounds which is No. 1 in 38 Universities.  Due to his leadership achievements, he had been awarded UTS Vice-Chancellor research excellence awards in Leadership category. 

Prof. Zhang is a Fellow of the Australian Computer Society (ACS) and a Senior Member of the IEEE Computer Society (IEEE). Additionally, he served in the ARC College of Experts from 2012 to 2014. He had been elected as the founding Chair of Steering Committee of International Conference on Knowledge Science, Engineering, and Management between 2006 and 2014. He has been serving as an Associate Editor for three international journals, including IEEE Transactions on Knowledge and Data Engineering from 2005 to 2008; and he served as General Chair, PC Chair, or Organising Chair for five international Conferences including KDD 2015, ICDM 2010 and WI/IAT 2008. He is also the Local Arrangements Chair of IJCAI-2017 in Melbourne (International Joint Conference on Artificial Intelligence).  

Professional

Awards:

  1. NSW Science and Engineering Awards 2011:  Award category – Engineering and Information and Communications Technology. (Only one in the state of NSW in 2011).
  2. UTS Vice-Chancellor's Awards for Research Excellence 2011: Award category – Research Leadership (co-recipient).

Professional Society:

  1. A College of Expert, Australian Research Council (ARC), from 2012 to 2014.
  2. A Fellow for Australian Computer Society (ACS), No. 3032411 since 2006.
  3. A Senior Member for The Institute of Electrical and Electronics Engineers, INC. (IEEE). No. 1711670 since 1995.
  4. A Member for Association for the Advancement Artificial Intelligence (AAAI), No. 37757 since 1995.
  5. An Overseas Assessor for Chinese Academy of Sciences from Jan. 2005 to Dec. 2008.

Keynote and Invited Speeches

  1. "Online Learning with Trapezoidal Data Streams", in the IFIP 9th International Conference on Intelligent Information Processing (IIP'16), 18-21 November 2016, Melbourne, Australia.
  2. "Current researches on Graph Processing and Mining", in the 3rd International Conference Data Science (ICDS'16), 22-24 June 2016, Xi'An, China.
  3. "Big Data related Research Issues and Progress", in the 14th International conference on Web Information System Engineering (WISE'13), 13-15 October 2013, Nanjing, China.
  4. "Cost-Sensitive Classification in Data Mining", in the 6th International Conference on Advanced Data Mining and Applications (ADMA'10), 19-21 November 2010, Chongqing, China.
  5. "Knowledge Discovery from Multiple Databases", in the 2nd International Conference on Emerging Trend in Engineering & technology (ICETET'09), 16-18 December 2009, Nagpur, India.
  6. "Data Mining for Social Security in E-Government Services" in the 3rd International Conference on Management of e-Commerce and e-Government (ICMeCG'09), 16-19 September 2009, Nanchang, China.
  7. "Developing Actionable Trading Strategies for Trading Agents" in the International Joint Conference on Web Intelligence and Intelligent Agent Technology (WI'09 & IAT'09), 15-18 September 2009, Milan, Italy.
  8. "Agent & Data Mining Interaction: mutual benefits for both communities" in the 3rd International KES Symposium on Agents and Multi-agent Systems – Technologies and Applications, 3-5 June 2009, Uppsala, Sweden.
  9. "Combined Association Rules Mining" in the National Database Conference of China, 23-25 October 2008. Guilin, China.
  10. "Against Expectation Mining for Stock Market Surveillance", in the Symposium of new technology and applications on Security technology, 23 -24 August 2007, Chengdu, China.
  11. "Data Mining for Stock Markets" in the International workshop of Data Mining for Business, 22 May 2007, Nanjing, China
  12. "Activity Mining to Strengthen Debt Prevention" in the Joint Rough Set Symposium, 14-16 May 2007, Toronto, Canada.
  13. "Missing Data Imputation with Parameter Optimization" in the 1st International Conference on Innovative Computing, Information and Control (ICICIC'06) on 30 August – 1 September 2006 in Beijing, China.
  14. "Domain-Driven Data Mining: Methodologies and Applications" in the fourth International Conference on Active Media Technology (AMT'06) on 7-9 June 2006 in Brisbane, Australia.
  15. "In-Depth Data Mining and Its Applications in Stock Market" in the 1st International Conference on Advanced Data Mining and Applications (ADMA'05) on 22-24 July 2005 in Wuhan, China.
  16. "Agent and Data Mining: Mutual Enhancement by Integration" in the Autonomous Intelligent Systems: Agents and Data Mining (AIS-ADM-05) Workshop on 6-8 June 2005 in St Petersburg, Russian.
  17. International ICSC Congress on Intelligent Systems and Applications on 11 December 2000 in Wollongong, Australia.

Journal Editorial Board:

  1. An Associate Editor for "IEEE Transactions on Knowledge and Data Engineering" (IEEE-TKDE) by IEEE Publisher from Jan. 2005 to Dec. 2008.
  2. An Editor for "Web Intelligence and Agent Systems: An International Journal" (WIAS) by IOS Press since it started in 2003.
  3. An Associate Editor for "International Journal of Data Warehousing and Mining" (IJDWM) by IGP Press since it started in 2004.
  4. An Associate Editor for "International Journal of Innovational Computing & Information Control" (IJICIC) since it started in 2004.
  5. An Editor for "International Journal of Computational Intelligence" (IJCI) since 2003.
  6. An Associate Editor for "Knowledge and Information Systems: An International Journal" (KAIS) by Springer-Verlag from 1998 to Jan. 2006.

Executive Committee:

  1. Member of Steering Committee of "the IEEE International Conference on Agents" since 2016.
  2. Chair of IEEE Computer Society Technical Committee of Intelligent Informatics (TCII) since June 2014. 
  3. Chair of "the ACS National Committee for Artificial Intelligence (AI)" since 2006.
  4. Chair of Steering Committee of "the International Conference on Knowledge Science, Engineering, and Management (KSEM)" from 2006 to 2014.
  5. Chair of Planning Committee of "Pacific Rim International Workshop on Multi-Agents (PRIMA)" from 2000 to 2002.
  6. Member of Steering Committee of "Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD)" since 2004.
  7. Member of Steering Committee of "Pacific Rim International Conference on Artificial Intelligence (PRICAI)" since 2004.
  8. Member of Steering Committee of "International Conference on Advanced Data Mining and Applications (ADMA)" since 2006.

Chairs for Conference Committees:

  1. Local Arrangements Chair for the 26th International Joint Conference on Artificial Intelligence (IJCAI'17), 19-25 August 2017, Melbourne, Sydney.
  2. General co-Chair for the 1st International Conference on Crowd Science and Engineering (ICCSE'16), 27-30 July 2016, Vancouver, Canada.
  3. General co-Chair for the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'15), 11-14 August 2015, Sydney, Australia.
  4. General co-Chair for the 8th International Conference on Advanced Data Mining and Applications (ADMA'12), 15-18 December 2012, Nanjing, China.
  5. General co-Chair for the 10th IEEE International Conference on Data Mining (ICDM'10), 13-17 December 2010, Sydney, Australia.
  6. General co-Chair for the 2008 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology, (WI'08 and IAT'08), 9-12 December 2008, Sydney, Australia.
  7. General co-Chair for the 2nd International Conference on Advanced Data Mining and Applications (ADMA'06), 14-16 August 2006, Xi'An, China.
  8. General co-Chair for the 1st International Conference on Knowledge Science, Engineering, and Management (KSEM'06), 6-8 August 2006, Guilin, China.
  9. Program Committee co-Chair for the 8th Pacific Rim International Conference on Artificial Intelligence (PRICAI'04), 9-13 August 2004, Auckland, New Zealand.
  10. General co-Chair for the 8th Pacific Asia Conference on Knowledge Discovery and Data Mining (PAKDD'04), 24-28 May 2004, Sydney, Australia.
  11. Program Committee Co-chair of for the 3rd Pacific Rim International Workshop on Multi-Agents (PRIMA'00), Melbourne, Australia, 28 - 29, August 2000.
  12. Organizing Committee Chair for the 6th Pacific Rim International Conference on Artificial Intelligence (PRICAI'00), Melbourne, Australia, 28 August - 2 September, 2000.
  13. Regional Liaison for Asia/Pacific for the 4th International Conference on MultiAgent Systems (ICMAS'00), Boston, USA, July 7-12, 2000.
  14. Program Committee co-Chair for the 2nd Pacific Rim International Workshop on Multi-Agents (PRIMA'99), Kyoto, Japan, 2 - 3 December, 1999.
  15. Program Committee co-Chair for the 7th Australian Joint Conference on Artificial Intelligence (AI'94), Armidale, Australia (1994).

Guest Editors

  1. Chengqi Zhang, Philip S. Yu, and David Bell, Special Issue on "Domain-Driven Data Mining", IEEE Transactions on Knowledge and Data Engineering (TKDE), June 2010.
  2. Longbing Cao, Zili Zhang, Vladimir Gorodetsky, Chengqi Zhang. Special Issue on "Interaction between agents and data mining", International Journal of Intelligent Information and Database Systems, 1(4): 2007.
  3. Chengqi Zhang, Qiang Yang, and Bing Liu, Special Issue on "Intelligent Data Preparation", IEEE Transactions on Knowledge and Data Engineering (TKDE), 17:9, Sept. 2005.
  4. Shichao Zhang, Chengqi Zhang, and Qiang Yang, Special Issue on "Information Enhancement for Data Mining", IEEE Intelligent Systems, 19:2, March/April, 2004.
  5. Shichao Zhang, Chengqi Zhang, and Qiang Yang, Special Issue on "Data Preparation for Data Mining", Applied Artificial Intelligence: an International Journal (AAI), 17:5-6, May-July 2003.
  6. Chengqi Zhang, Ling Guan, and Zheru Chi, Special Issue on "Learning in Intelligent Algorithms and Systems Design", Journal of Advanced Computational Intelligence, 3:6, Dec. 1999.
Image of Chengqi Zhang
Director, A/DRsch Ctr Quantum Computat'n & Intelligent Systs
Core Member, Joint Research Centre in Intelligent Systems Membership
Director, QCIS - Quantum Computation and Intelligent Systems
Core Member, QCIS - Quantum Computation and Intelligent Systems
Core Member, ACRI - Australia China Relations Institute
Core Member, AAI - Advanced Analytics Institute
BSc (Fudan), MSc (JLU), PhD (UQ), DSc (Deakin)
Fellow, Australian Computer Society
Senior Member, Institution of Electrical and Electronic Engineers
Member, Association for the Advancement of Artificial Intelligence
 
Phone
+61 2 9514 7941

Research Interests

Chengqi Zhang's key areas of research are Data Mining and its applications. He has achieved outstanding research results and provided excellent leadership during his academic career. He has undertaken very important theoretical research to discover new knowledge, such as co-ordination under uncertainty in Multi-Agent Systems, Negative Association Rule mining, Multi-Database mining, and Domain Driven Data Mining, which are recognised in high quality publications. His research has also been applied in industry, in particular the financial industry and government agency of social security. To date, he has attracted more than $5,000,000 in research grants (mostly awarded by the Australia Research Council – ARC), published nearly 300 refereed papers, edited 16 books, and published seven monographs. Some of his papers have been published in highly prestigious international journals such as Artificial Intelligence, IEEE Transactions, and ACM Transactions. He undertakes individual research, and he also leads his own highly-regarded research group. Currently, He is the Executive Director UTS Data Science, as well as a leader for the Data Mining Program with the Australian Capital Markets Cooperative Research Centre.

Can supervise: Yes

            Graduated PhD Students

I have supervised 29 PhD students in completion.

Year

(awarded)

Student Name

Thesis Title

Degree

Current job and place

2016

Ting Guo

Efficient Diagnosability Testing and Optimising

PhD

Research Scientist (NICTA)

2016

Jia Wu

Multi-Graph Learning

PhD

Research Fellow (UTS)

2016

Shirui Pan

Mining Information Network Streams

PhD

Research Fellow (UTS)

2015

Xueping Peng

Research of personalised recommendation and information retrieval based on Web logs mining

PhD

Research Fellow (UTS)

2014

Guodong Long

Instance and Feature based Classification Enhancement for Short & Sparse Texts

PhD

Lecturer (UTS)

2014

Guohua Liang

Ensemble Predictors: Empirical Studies on Learners Performance and Sample Distributions

PhD

2013

Tao Wang

Efficient Techniques for Cost-Sensitive Learning with Multiple Cost Considerations

PhD

Senior System Analyst at Mission Australia

2012

Yong Yang

Discovering Behaviour Patterns in Security Applications

PhD

Technical Data manager at Independent Hospital Pricing Authority (IHPA)

Prof. Chengqi Zhang has been appointed as the Research Professor of Information Technology since 2002 at UTS. Before then, he taught at Deakin University for three years, and at the University of New England (UNE) for nine years. He also taught in the Chinese University of Hong Kong for half a year in 2003 whilst on sabbatical leave. His teaching covered a wide range from first year undergraduate units to postgraduate units, including: "Introduction to Business Information Technology" for first-year students, "Data Structures" and "Database" for second-year students, "Electronic Business", "Advanced Database", "Non-procedural Languages", and "Knowledge Engineering" for third-year students, "Expert Systems", "Advanced Non-procedural Languages", "Distributed Artificial Intelligence", "Logical Foundation of Artificial Intelligence", "Knowledge Management" and "Introduction to Multi-Agent Systems" for Honours and Masters students. He received very positive feedbacks from his students.

In addition, he had mentored junior staff by assessing the peer-to-peer teaching performance by attending their classes. he had received positive feedbacks from staff who acknowledged that their performance had been enhanced by his support.

Books

Chen, Q., Chen, B. & Zhang, C. 2014, Intelligent strategies for pathway mining model and pattern identification, Springer International Publishing, Switzerland.
View/Download from: UTS OPUS or Publisher's site
Cao, L., Yu, P., Zhang, C. & Zhao, Y. 2010, Domain Driven Data Mining, 1, Springer, New York, USA.
View/Download from: UTS OPUS or Publisher's site
* Bridges the gap between business expectations and research output * Includes techniques, methodologies and case studies in real-life enterprise DM * Addresses new areas such as blog mining In the present thriving global economy a need has evolved for complex data analysis to enhance an organizations production systems, decision-making tactics, and performance. In turn, data mining has emerged as one of the most active areas in information technologies. Domain Driven Data Mining offers state-of the-art research and development outcomes on methodologies, techniques, approaches and successful applications in domain driven, actionable knowledge discovery.
Cao, L., Yu, P.S., Zhang, C. & Zhang, H. 2009, Data mining for business applications.
View/Download from: Publisher's site
Data Mining for Business Applications presents state-of-the-art data mining research and development related to methodologies, techniques, approaches and successful applications. The contributions of this book mark a paradigm shift from "data-centered pattern mining" to "domain-driven actionable knowledge discovery (AKD)" for next-generation KDD research and applications. The contents identify how KDD techniques can better contribute to critical domain problems in practice, and strengthen business intelligence in complex enterprise applications. The volume also explores challenges and directions for future data mining research and development in the dialogue between academia and business. Part I centers on developing workable AKD methodologies, including: domain-driven data mining post-processing rules for actions domain-driven customer analytics the role of human intelligence in AKD maximal pattern-based cluster ontology mining Part II focuses on novel KDD domains and the corresponding techniques, exloring the mining of emergent areas and domains such as: social security data community security data gene sequences mental health information traditional Chinese medicine data cancer related data blog data sentiment information web data procedures moving object trajectories land use mapping higher education data flight scheduling algorithmic asset management Researchers, practitioners and university students in the areas of data mining and knowledge discovery, knowledge engineering, human-computer interaction, artificial intelligence, intelligent information processing, decision support systems, knowledge management, and KDD project management are sure to find this a practical and effective means of enhancing their understanding of and using data mining in their own projects. © 2009 Springer Science+Business Media, LLC All rights reserved.
Chen, Q., Zhang, C. & Zhang, S. 2008, Secure Transaction Protocol Analysis, 1, Springer, Berlin, Germany.
View/Download from: UTS OPUS or Publisher's site
Security protocols (cryptographic protocol) have been widely used to not only achieve traditional goals of data confidentiality, integrity and authentication, but also secure a variety of other desired characteristics of computer-mediated transactions recently. To guarantee reliable protocols, a great deal of formal methods has been undertaken not only to develop diverse tools with specialized purpose or general purpose, but also to apply them to the analysis of realistic protocols. Many of them have been proved to be useful in detecting some intuitive attacks in security protocols. In many cases, a useful feedback is supplied to designers in order to improve the protocolâs security. For both beginners and experienced researchers, this book will present useful information on relevant technologies that can be extended or adapted. A comprehensive introduction to the basic concepts and core techniques will be presented. In this chapter, we explain what is security protocols and how they can be used to ensure secure transactions, what challenging issues in e-commerce (electronic commerce) are, why security protocol analysis important, how they are performed, and what are the ongoing efforts and relevant work. We will also explain the limitations in previous work and why it is important to develop new approaches. These questions will be briefly answered. In particular, we will focus on the discussion regarding secure transaction protocols. Finally, some emerging issues and the ways they are being met are also described.
Zhang, Z. & Zhang, C. 2004, Agent Based Hybrid Intelligent Systems, 1, Springer-Verlag, Berlin, Germany.
View/Download from: UTS OPUS
Zhang, S., Zhang, C. & Wu, X. 2004, Knowledge Discovery in Multiple Databases, Springer-Verlag, Berlin, Germany.
Zhang, C. & Zhang, S. 2002, Association Rules Mining: Models and Algorithms, 1, Springer-Verlag, Germany.

Chapters

Yang, Y., Luo, D. & Zhang, C. 2010, 'A Multiple System Performance Monitoring Model for Web Services' in Cao, L., Bazzan, A.L.C., Gorodetsky, V., Mitkas, P.A., GerhardWeiss & Yu, P.S. (eds), Lecture Notes in Artificial Intelligence 5980 - Agents and Data Mining Interaction, Springer, Germany, pp. 149-161.
View/Download from: UTS OPUS or Publisher's site
With the exponential growth of the world wide web, web services have becoming more and more popular. However, performance monitoring is a key issue in the booming service-orient architecture regime. Under such loosely coupled and open distributed computing environments, it is necessary to provide a performance monitoring model to estimate the likely performance of a service provider. Although much has been done to develop models and techniques to assess performance of services (i.e. QOSs), most of solutions are based on deterministic performance monitoring value or boolean logic. Intuitively, probabilistic representation could be a more precise and nature way for performance monitoring. In this paper, we propose a Bayesian approach to analyze service providerâs behavior to infer the rationale for performance monitoring in the web service environment. This inference facilitates the user to predict service providerâs performance, based on historical temporal records in a sliding window. Distinctively, it combines evidences from another system (For example, recommendation opinions of third parties) to provide complementary support for decision making. To our best of knowledge, this is the first approach to squeeze a final integrated performance prediction with multiple systems in Web services.
Su, G., Ying, M. & Zhang, C. 2010, 'An ADL-Approach to Specifying and Analyzing Centralized-Mode Architectural Connection' in Babar, M.A. & Gorton, I. (eds), Lecture Notes in Computer Science 6285 - Software Architecture, Springer, Germany, pp. 8-23.
View/Download from: UTS OPUS or Publisher's site
A rigorous paradigm coordinating components is important in the design stage of large-scale software engineering. In this paper we propose a new Architecture Description Language, called ACDL, to represent the centralized-mode architectural connection in which all components are linked by a single connector. Following one usual approach to architectural description, in which component types and components are distinguished, and connectors integrate behaviors of components by specifying their coordination protocols, ACDL describes connectors in such a way that connectors are insensitive to the numbers of attached same-type components. Based on ACDL, we develop analytic techniques to facilitate the system checking of temporal properties of an architecture. In particular, our method shows to what extent one can add, delete and replace components without making the whole system lose desired temporal properties, and improves the system checking in several ways, for example enhancing the use of previous checking results to deal with new checking problems.
Zhang, H., Zhao, Y., Cao, L., Zhang, C. & Bohlscheid, H. 2010, 'Rare class association rule mining with multiple imbalanced attributes' in Koh, Y.S. & Rountree, N. (eds), Rare Association Rule Mining and Knowledge Discovery: Technologies for Infrequent and Critical Event, IGI Global, Hershey, Pennsylvania, pp. 66-75.
View/Download from: UTS OPUS
In this chapter, the authors propose a novel framework for rare class association rule mining. In each class association rule, the right-hand is a target class while the left-hand may contain one or more attributes. This algorithm is focused on the multiple imbalanced attributes on the left-hand. In the proposed framework, the rules with and without imbalanced attributes are processed in parallel. The rules without imbalanced attributes are mined through a standard algorithm while the rules with imbalanced attributes are mined based on newly defined measurements. Through simple transformation, these measurements can be in a uniform space so that only a few parameters need to be specified by user. In the case study, the proposed algorithm is applied in the social security field. Although some attributes are severely imbalanced, rules with a minority of imbalanced attributes have been mined efficiently.
Zhao, Y., Cao, L., Zhang, H. & Zhang, C. 2009, 'Data Clustering' in Ferraggine, V.E., Doorn, J.H. & Rivero, L.C. (eds), Handbook of Research on Innovations in Database Technologies and Applications: Current and Future Tr, IGI Global, USA, pp. 562-572.
View/Download from: UTS OPUS or Publisher's site
Clustering is one of the most important techniques in data mining. This chapter presents a survey of popular approaches for data clustering, including well-known clustering techniques, such as partitioning clustering, hierarchical clustering, density-based clustering and grid-based clustering, and recent advances in clustering, such as subspace clustering, text clustering and data stream clustering. The major challenges and future trends of data clustering will also be introduced in this chapter. The remainder of this chapter is organized as follows. The background of data clustering will be introduced in Section 2, including the definition of clustering, categories of clustering techniques, features of good clustering algorithms, and the validation of clustering. Section 3 will present main approaches for clustering, which range from the classic partitioning and hierarchical clustering to recent approaches of bi-clustering and semisupervised clustering. Challenges and future trends will be discussed in Section 4, followed by the conclusions in the last section.
Zhao, Y., Zhang, H., Cao, L., Bohlscheid, H., Ou, Y. & Zhang, C. 2009, 'Data Mining Applications in Social Security' in Cao, L., Yu, P.S., Zhang, C. & Zhang, H. (eds), Data Mining for Business Applications, Springer, New York, USA, pp. 81-96.
View/Download from: UTS OPUS or Publisher's site
This chapter presents four applications of data mining in social security. The first is an application of decision tree and association rules to find the demographic patterns of customers. Sequence mining is used in the second application to find activity sequence patterns related to debt occurrence. In the third application, combined association rules are mined from heterogeneous data sources to discover patterns of slow payers and quick payers. In the last application, clustering and analysis of variance are employed to check the effectiveness of a new policy.
Cao, L., Yu, P., Zhang, C. & Zhang, H. 2009, 'Introduction to Domain Driven Data Mining' in Cao, L., Yu, P.S., Zhang, C. & Zhang, H. (eds), Data Mining for Business Applications, Springer, New York, USA, pp. 3-10.
View/Download from: UTS OPUS or Publisher's site
Data Mining for Business Applications presents state-of-the-art data mining research and development related to methodologies, techniques, approaches and successful applications. The contributions of this book mark a paradigm shift from "data-centered pattern mining" to "domain-driven actionable knowledge discovery (AKD)" for next-generation KDD research and applications. The contents identify how KDD techniques can better contribute to critical domain problems in practice, and strengthen business intelligence in complex enterprise applications. The volume also explores challenges and directions for future data mining research and development in the dialogue between academia and business.
Wu, S., Zhao, Y., Zhang, H., Zhang, C., Cao, L. & Bohlscheid, H. 2009, 'Debt Detection in Social Security by Adaptive Sequence Classification' in Karagiannis, D. & Jin, Z. (eds), Knowledge Science, Engineering and Management, Springer, Germany, pp. 192-203.
View/Download from: UTS OPUS or Publisher's site
Debt detection is important for improving payment accuracy in social security. Since debt detection from customer transaction data can be generally modelled as a fraud detection problem, a straightforward solution is to extract features from transaction sequences and build a sequence classifier for debts. For long-running debt detections, the patterns in the transaction sequences may exhibit variation from time to time, which makes it imperative to adapt classification to the pattern variation. In this paper, we present a novel adaptive sequence classification framework for debt detection in a social security application. The central technique is to catch up with the pattern variation by boosting discriminative patterns and depressing less discriminative ones according to the latest sequence data.
Moemeng, C., Gorodetsky, V., Zuo, Z., Yang, Y. & Zhang, C. 2009, 'Agent-Based Distributed Data Mining: A Survey' in Longbing Cao (ed), Data Mining and Multi-agent Integration, Springer, New York, USA, pp. 1-12.
View/Download from: UTS OPUS or Publisher's site
Distributed data mining is originated from the need of mining over decentralised data sources. Data mining techniques involving in such complex environment must encounter great dynamics due to changes in the system can affect the overall performance of the system. Agent computing whose aim is to deal with complex systems has revealed opportunities to improve distributed data mining systems in a number of ways. This paper surveys the integration of multi-agent system and distributed data mining, also known as agent-based distributed data mining, in terms of significance, system overview, existing systems, and research trends.
Cao, L., Yu, P.S., Zhang, C. & Zhang, H. 2009, 'Preface' in Cao, L., Yu, P.S., Zhang, C. & Zhang, H. (eds), Data Mining for Business Applications, Springer, pp. v-vi.
Cao, L. & Zhang, C. 2008, 'Domain Driven Data Mining' in Taniar, D. (ed), Data Mining and Knowledge Discovery Technologies, IGI Global, USA, pp. 196-223.
View/Download from: UTS OPUS
Quantitative intelligence based traditional data mining is facing grand challenges from real-world enterprise and cross-organization applications. For instance, the usual demonstration of specific algorithms cannot support business users to take actions to their advantage and needs. We think this is due to Quantitative Intelligence focused data-driven philosophy. It either views data mining as an autonomous data-driven, trial-and-error process, or only analyzes business issues in an isolated, case-by-case manner. Based on experience and lessons learnt from real-world data mining and complex systems, this article proposes a practical data mining methodology referred to as Domain-Driven Data Mining. On top of quantitative intelligence and hidden knowledge in data, domain-driven data mining aims to meta-synthesize quantitative intelligence and qualitative intelligence in mining complex applications in which human is in the loop. It targets actionable knowledge discovery in constrained environment for satisfying user preference. Domain-driven methodology consists of key components including understanding constrained environment, business-technical questionnaire, representing and involving domain knowledge, human-mining cooperation and interaction, constructing next-generation mining infrastructure, in-depth pattern mining and postprocessing, business interestingness and actionability enhancement, and loop-closed human-cooperated iterative refinement. Domain-driven data mining complements the data-driven methodology, the metasynthesis of qualitative intelligence and quantitative intelligence has potential to discover knowledge from complex systems, and enhance knowledge actionability for practical use by industry and business.
Cao, L., Zhang, C., Luo, D. & Dai, R. 2007, 'Intelligence Metasynthesis in Building Business Intelligence Systems' in Carbonell, J.G., Siekmann, J., Zhong, N., Liu, J., Yao, Y., Wu, J., Lu, S. & Li, K. (eds), Lecture Notes in Artificial Intelligence - Lecture Notes in Computer Science (Book Series), Springer, Germany, pp. 454-470.
View/Download from: UTS OPUS or Publisher's site
In our previous work, we have analyzed the shortcomings of existing business intelligence (BI) theory and its actionable capability. One of the works we have presented is the ontology-based integration of business, data warehousing and data mining. This way may make existing BI systems as user and business-friendly as expected. However, it is challenging to tackle issues and construct actionable and business friendly systems by simply improving existing BI framework. Therefore, in this paper, we further propose a new framework for constructing next generation BI systems. That is intelligence metasynthesis, namely the next-generation BI systems should to some extent synthesize four types of intelligence, including data intelligence, domain intelligence, human intelligence and network/web intelligence. The theory for guiding the intelligence metasynthesis is metasynthetic engineering. To this end, an appropriate intelligence integration framework is substantially important. We first address the roles of each type of intelligence in developing nextgeneration BI systems. Further, implementation issues are addressed by discussing key components for synthesizing the intelligence. The proposed framework is based on our real-world experience and practice in designing and implementing BI systems. It also greatly benefits from multi-disciplinary knowledge dialog such as complex intelligent systems and cognitive sciences. The proposed theoretical framework has potential to deal with key challenges in existing BI framework and systems.
Zhang, Z. & Zhang, C. 2004, 'Constructing hybrid intelligent systems for data mining from agent perspectives' in Zhong, N. & Liu, J. (eds), Intelligent Technologies for Information Analysis, Springer, Berlin, Germany, pp. 333-359.
Zhang, C. & Zhang, Z. 2003, 'An Agent-based Soft Computing Society with Applications in Financial Investment Planning' in Yu, X. & Kacprzyk, J. (eds), Applied Decision Support with Soft Computing, Springer, Germany, pp. 99-126.

Conferences

Wu, W., Li, B., Chen, L. & Zhang, C. 2016, 'Cross-View Feature Hashing for Image Retrieval', Advances in Knowledge Discovery and Data Mining, Pacific Asia Knowledge Discovery and Data Mining Conference (PAKDD) 2016, Springer International Publishing, The University of Auckland Auckland, New Zealand, pp. 203-214.
View/Download from: Publisher's site
Traditional cross-view information retrieval mainly rests on correlating two sets of features in different views. However, features in different views usually have different physical interpretations. It may be inappropriate to map multiple views of data onto a shared feature space and directly compare them. In this paper, we propose a simple yet effective Cross-View Feature Hashing (CVFH) algorithm via a 'partition and match approach. The feature space for each view is bi-partitioned multiple times using B hash functions and the resulting binary codes for all the views can thus be represented in a compatible B-bit Hamming space. To ensure that hashed feature space is effective for supporting generic machine learning and information retrieval functionalities, the hash functions are learned to satisfy two criteria: (1) the neighbors in the original feature spaces should be also close in the Hamming space; and (2) the binary codes for multiple views of the same sample should be similar in the shared Hamming space. We apply CVFH to crossview image retrieval. The experimental results show that CVFH can outperform the Canonical Component Analysis (CCA) based cross-view method.
Wu, W., Li, B., Chen, L. & Zhang, C. 2016, 'Cross-view feature hashing for image retrieval', Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), pp. 203-214.
View/Download from: UTS OPUS or Publisher's site
© Springer International Publishing Switzerland 2016. Traditional cross-view information retrieval mainly rests on correlating two sets of features in different views. However, features in different views usually have different physical interpretations. It may be inappropriate to map multiple views of data onto a shared feature space and directly compare them. In this paper, we propose a simple yet effective Cross-View Feature Hashing (CVFH) algorithm via a 'partition and match approach. The feature space for each view is bi-partitioned multiple times using B hash functions and the resulting binary codes for all the views can thus be represented in a compatible B-bit Hamming space. To ensure that hashed feature space is effective for supporting generic machine learning and information retrieval functionalities, the hash functions are learned to satisfy two criteria: (1) the neighbors in the original feature spaces should be also close in the Hamming space; and (2) the binary codes for multiple views of the same sample should be similar in the shared Hamming space. We apply CVFH to cross view image retrieval. The experimental results show that CVFH can outperform the Canonical Component Analysis (CCA) based cross-view method.
Zhang, Q., Zhang, Q., Long, G., Zhang, P. & Zhang, C. 2016, 'Exploring heterogeneous product networks for discovering collective marketing hyping behavior', Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), pp. 40-51.
View/Download from: UTS OPUS or Publisher's site
© Springer International Publishing Switzerland 2016. Online spam comments often misguide users during online shopping. Existing online spam detection methods rely on semantic clues, behavioral footprints, and relational connections between users in review systems. Although these methods can successfully identify spam activities, evolving fraud strategies can successfully escape from the detection rules by purchasing positive comments from massive random users, i.e., user Cloud. In this paper, we study a new problem, Collective Marketing Hyping detection, for spam comments detection generated from the user Cloud. It is defined as detecting a group of marketing hyping products with untrustful marketing promotion behaviour. We propose a new learning model that uses heterogenous product networks extracted from product review systems. Our model aims to mining a group of hyping activities, which differs from existing models that only detect a single product with hyping activities. We show the existence of the Collective Marketing Hyping behavior in real-life networks. Experimental results demonstrate that the product information network can effectively detect fraud intentional product promotions.
Chen, Q., Lan, C., Li, J., Chen, B., Wang, L. & Zhang, C. 2016, 'Depth-first search encoding of RNA substructures', Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), pp. 328-334.
View/Download from: Publisher's site
© Springer International Publishing Switzerland 2016.RNA structural motifs are important in RNA folding process. Traditional index-based and shape-based schemas are useful in modeling RNA secondary structures but ignore the structural discrepancy of individual RNA family member. Further, the in-depth analysis of underlying substructure pattern is underdeveloped owing to varied and unnormalized substructures. This prevents us from understanding RNAs functions. This article proposes a DFS (depth-first search) encoding for RNA substructures. The results show that our methods are useful in modelling complex RNA secondary structures.
Liu, B., Chen, L., Liu, C., Zhang, C. & Qiu, W. 2016, 'Mining co-locations from continuously distributed uncertain spatial data', Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), pp. 66-78.
View/Download from: Publisher's site
© Springer International Publishing Switzerland 2016.A co-location pattern is a group of spatial features whose instances tend to locate together in geographic space. While traditional co-location mining focuses on discovering co-location patterns from deterministic spatial data sets, in this paper, we study the problem in the context of continuously distributed uncertain data. In particular, we aim to discover co-location patterns from uncertain spatial data where locations of spatial instances are represented as multivariate Gaussian distributions. We first formulate the problem of probabilistic co-location mining based on newly defined prevalence measures. When the locations of instances are represented as continuous variables, the major challenges of probabilistic co-location mining lie in the efficient computation of prevalence measures and the verification of the probabilistic neighborhood relationship between instances. We develop an effective probabilistic co-location mining framework integrated with optimization strategies to address the challenges. Our experiments on multiple datasets demonstrate the effectiveness of the proposed algorithm.
Zhang, Q., Zhang, P., Long, G., Ding, W., Zhang, C. & Wu, X. 2016, 'Towards mining trapezoidal data streams', Proceedings - IEEE International Conference on Data Mining, ICDM, pp. 1111-1116.
View/Download from: UTS OPUS or Publisher's site
© 2015 IEEE.We study a new problem of learning from doubly-streaming data where both data volume and feature space increase over time. We refer to the problem as mining trapezoidal data streams. The problem is challenging because both data volume and feature space are increasing, to which existing online learning, online feature selection and streaming feature selection algorithms are inapplicable. We propose a new Sparse Trapezoidal Streaming Data mining algorithm (STSD) and its two variants which combine online learning and online feature selection to enable learning trapezoidal data streams with infinite training instances and features. Specifically, when new training instances carrying new features arrive, the classifier updates the existing features by following the passive-aggressive update rule used in online learning and updates the new features with the structural risk minimization principle. Feature sparsity is also introduced using the projected truncation techniques. Extensive experiments on the demonstrated UCI data sets show the performance of the proposed algorithms.
Yan, Y., Tan, M., Yang, Y., Tsang, I. & Zhang, C. 2015, 'Scalable maximum margin matrix factorization by active riemannian subspace search', Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 2015), IJCAI International Joint Conference on Artificial Intelligence, AAAI, Buenos Aires, Argentina, pp. 3988-3994.
View/Download from: UTS OPUS
The user ratings in recommendation systems are usually in the form of ordinal discrete values. To give more accurate prediction of such rating data, maximum margin matrix factorization (M3F) was proposed. Existing M3F algorithms, however, either have massive computational cost or require expensive model selection procedures to determine the number of latent factors (i.e. the rank of the matrix to be recovered), making them less practical for large scale data sets. To address these two challenges, in this paper, we formulate M3F with a known number of latent factors as the Riemannian optimization problem on a fixed-rank matrix manifold and present a block-wise nonlinear Riemannian conjugate gradient method to solve it ef- ficiently. We then apply a simple and efficient active subspace search scheme to automatically detect the number of latent factors. Empirical studies on both synthetic data sets and large real-world data sets demonstrate the superior efficiency and effectiveness of the proposed method.
Liu, B., Chen, L., Liu, C., Zhang, C. & Qiu, W. 2015, 'RCP Mining: Towards the Summarization of Spatial Co-location Patterns', Advances in Spatial and Temporal Databases (LNCS), 14th International Symposium on Advances in Spatial and Temporal Databases, Springer, Hong Kong, China, pp. 451-469.
View/Download from: UTS OPUS or Publisher's site
Co-location pattern mining is an important task in spatial data mining. However, the traditional framework of co-location pattern mining produces an exponential number of patterns because of the downward closure property, which makes it hard for users to understand, or apply. To address this issue, in this paper, we study the problem of mining representative co-location patterns (RCP). We first define a covering relationship between two co-location patterns by finding a new measure to appropriately quantify the distance between patterns in terms of their prevalence, based on which the problem of RCP mining is formally formulated. To solve the problem of RCP mining, we first propose an algorithm called RCPFast, adopting the post-mining framework that is commonly used by existing distance-based pattern summarization techniques. To address the peculiar challenge in spatial data mining, we further propose another algorithm, RCPMS, which employs the mine-and-summarize framework that pushes pattern summarization into the co-location mining process. Optimization strategies are also designed to further improve the performance of RCPMS. Our experimental results on both synthetic and real-world data sets demonstrate that RCP mining effectively summarizes spatial co-location patterns, and RCPMS is more efficient than RCPFast, especially on dense data sets.
Wang, H., Zhang, P., Chen, L., Liu, H. & Zhang, C. 2015, 'Online Diffusion Source Detection in Social Networks', Proceedings of the 2015 International Joint Conference on Neural Networks (IJCNN), International Joint Conference on Neural Networks (IJCNN), IEEE, Killarney, Ireland, pp. 1-8.
View/Download from: Publisher's site
In this paper we study a new problem of online diffusion source detection in social networks. Existing work on diffusion source detection focuses on offline learning, which assumes data collected from network detectors are static and a snapshot of network is available before learning. However, an offline learning model does not meet the needs of early warning, real-time awareness, and real-time response of malicious information spreading in social networks. In this paper, we combine online learning and regression-based detection methods for real-time diffusion source detection. Specifically, we propose a new 1 non-convex regression model as the learning function, and an Online Stochastic Sub-gradient algorithm (OSS for short). The proposed model is empirically evaluated on both synthetic and real-world networks. Experimental results demonstrate the effectiveness of the proposed model.
Song, K., Feng, S., Gao, W., Wang, D., Chen, L. & Zhang, C. 2015, 'Building emotion lexicon from microblogs by combining effects of seed words and emoticons in a heterogeneous graph', Proceedings of the 26th ACM Conference on Hypertext & Social Media, The 26th ACM Conference on HyperText and Social Media (HT'15), ACM, Guzelyurt, Northern Cyprus, pp. 283-292.
View/Download from: Publisher's site
As an indispensable resource for emotion analysis, emotion lexicons have attracted increasing attention in recent years. Most existing methods focus on capturing the single emotional effect of words rather than the emotion distributions which are helpful to model multiple complex emotions in a subjective text. Meanwhile, automatic lexicon building methods are overly dependent on seed words but neglect the effect of emoticons which are natural graphical labels of fine-grained emotion. In this paper, we propose a novel emotion lexicon building framework that leverages both seed words and emoticons simultaneously to capture emotion distributions of candidate words more accurately. Our method overcomes the weakness of existing methods by combining the effects of both seed words and emoticons in a unified three-layer heterogeneous graph, in which a multi-label random walk (MLRW) algorithm is performed to strengthen the emotion distribution estimation. Experimental results on real-world data reveal that our constructed emotion lexicon achieves promising results for emotion classification compared to the state-of-the-art lexicons
Wang, H., Zhang, P., Tsang, I., Chen, L. & Zhang, C. 2015, 'Defragging Subgraph Features for Graph Classification', Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, The 24th ACM International on Conference on Information and Knowledge Management (CIKM'15), ACM, Melbourne, VIC, Australia, pp. 1687-1690.
View/Download from: UTS OPUS or Publisher's site
Graph classification is an important tool for analysing structured and semi-structured data, where subgraphs are commonly used as the feature representation. However, the number and size of subgraph features crucially depend on the threshold parameters of frequent subgraph mining algorithms. Any improper setting of the parameters will generate many trivial short-pattern subgraph fragments which dominate the feature space, distort graph classifiers and bury interesting long-pattern subgraphs. In this paper, we propose a new Subgraph Join Feature Selection (SJFS) algorithm. The SJFS algorithm, by forcing graph classifiers to join short-pattern subgraph fragments, can defrag trivial subgraph features and deliver long-pattern interesting subgraphs. Experimental results on both synthetic and real-world social network graph data demonstrate the performance of the proposed method.
Qin, L., Li, R.H., Chang, L. & Zhang, C. 2015, 'Locally densest subgraph discovery', Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, Sydney, Australia, pp. 965-974.
View/Download from: UTS OPUS or Publisher's site
© 2015 ACM. Mining dense subgraphs from a large graph is a fundamental graph mining task and can be widely applied in a variety of application domains such as network science, biology, graph database, web mining, graph compression, and micro-blogging systems. Here a dense subgraph is defined as a subgraph with high density (#.edge/#.node). Existing studies of this problem either focus on finding the densest subgraph or identifying an optimal clique-like dense subgraph, and they adopt a simple greedy approach to find the top-k dense subgraphs. However, their identified subgraphs cannot be used to represent the dense regions of the graph. Intuitively, to represent a dense region, the subgraph identified should be the subgraph with highest density in its local region in the graph. However, it is non-trivial to formally model a locally densest subgraph. In this paper, we aim to discover top-k such representative locally densest subgraphs of a graph. We provide an elegant parameter-free definition of a locally densest subgraph. The definition not only fits well with the intuition, but is also associated with several nice structural properties. We show that the set of locally densest subgraphs in a graph can be computed in polynomial time. We further propose three novel pruning strategies to largely reduce the search space of the algorithm. In our experiments, we use several real datasets with various graph properties to evaluate the effectiveness of our model using four quality measures and a case study. We also test our algorithms on several real web-scale graphs, one of which contains 118.14 million nodes and 1.02 billion edges, to demonstrate the high efficiency of the proposed algorithms.
Wang, H., Zhang, P., Chen, L. & Zhang, C. 2015, 'Socialanalysis: A Real-Time query and mining system from social media data streams', Databases Theory and Applications (LNCS), Australasian Database Conference, Springer, Melbourne, Australia, pp. 318-322.
View/Download from: Publisher's site
© Springer International Publishing Switzerland 2015. In this paper, we present our recent progress of designing a real-time system, SocialAnalysis, to discover and summarize emergent social events from social media data streams. In social networks era, people always frequently post messages or comments about their activities and opinions. Hence, there exist temporal correlations between the physical world and virtual social networks, which can help us to monitor and track social events, detecting and positioning anomalous events before their outbreakings, so as to provide early warning. The key technologies in the system include: (1) Data denoising methods based on multi-features, which screens out the query-related event data from massive background data. (2) Abnormal events detection methods based on statistical learning, which can detect anomalies by analyzing and mining a series of observations and statistics on the time axis. (3) Geographical position recognition, which is used to recognize regions where abnormal events may happen.
Wu, J., Pan, S., Zhu, X., Cai, Z. & Zhang, C. 2015, 'Multi-graph-view learning for complicated object classification', IJCAI'15 Proceedings of the 24th International Conference on Artificial Intelligence, International Joint Conference on Artificial Intelligence, Buenos Aires, pp. 3953-3959.
Wu, J., Hong, Z., Pan, S., Zhu, X., Zhang, C. & Cai, Z. 2014, 'Multi-Graph Learning with Positive and Unlabeled Bags', 2014 SIAM International Conference on Data Mining, SIAM, Philadelphia, Pennsylvania, USA, pp. 217-225.
View/Download from: UTS OPUS or Publisher's site
In this paper, we formulate a new multi-graph learning task with only positive and unlabeled bags, where labels are only available for bags but not for individual graphs inside the bag. This problem setting raises significant challenges because bag-of-graph setting does not have features to directly represent graph data, and no negative bags exits for deriving discriminative classification models. To solve the challenge, we propose a puMGL learning framework which relies on two iteratively combined processes for multigraph learning: (1) deriving features to represent graphs for learning; and (2) deriving discriminative models with only positive and unlabeled graph bags. For the former, we derive a subgraph scoring criterion to select a set of informative subgraphs to convert each graph into a feature space. To handle unlabeled bags, we assign a weight value to each bag and use the adjusted weight values to select most promising unlabeled bags as negative bags. A margin graph pool (MGP), which contains some representative graphs from positive bags and identified negative bags, is used for selecting subgraphs and training graph classifiers. The iterative subgraph scoring, bag weight updating, and MGP based graph classification forms a closed loop to find optimal subgraphs and most suitable unlabeled bags for multi-graph learning. Experiments and comparisons on real-world multigraph data demonstrate the algorithm performance
Wu, J., Zhu, X., Zhang, C. & Cai, Z. 2014, 'Multi-Instance Learning from Positive and Unlabeled Bags', The 18th Pacific-Asia Conference on Knowledge Discovery and Data Mining, The 18th Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer International Publishing, Taiwan, pp. 237-248.
View/Download from: UTS OPUS
Wu, J., Pan, S., Cai, Z., Zhu, X. & Zhang, C. 2014, 'Dual instance and attribute weighting for Naive Bayes classification', Proceedings of the International Joint Conference on Neural Networks, pp. 1675-1679.
View/Download from: UTS OPUS or Publisher's site
© 2014 IEEE. Naive Bayes (NB) network is a popular classification technique for data mining and machine learning. Many methods exist to improve the performance of NB by overcoming its primary weakness the assumption that attributes are conditionally independent given the class, using techniques such as backwards sequential elimination and lazy elimination. Some weighting technologies, including attribute weighting and instance weighting, have also been proposed to improve the accuracy of NB. In this paper, we propose a dual weighted model, namely DWNB, for NB classification. In DWNB, we firstly employ an instance similarity based method to weight each training instance. After that, we build an attribute weighted model based on the new training data, where the calculation of the probability value is based on the embedded instance weights. The dual instance and attribute weighting allows DWNB to tackle the conditional independence assumption for accurate classification. Experiments and comparisons on 36 benchmark data sets demonstrate that DWNB outperforms existing weighted NB algorithms.
Wu, J., Cai, Z., Pan, S., Zhu, X. & Zhang, C. 2014, 'Attribute weighting: How and when does it work for Bayesian Network Classification', Proceedings of the International Joint Conference on Neural Networks, pp. 4076-4083.
View/Download from: UTS OPUS or Publisher's site
© 2014 IEEE. A Bayesian Network (BN) is a graphical model which can be used to represent conditional dependency between random variables, such as diseases and symptoms. A Bayesian Network Classifier (BNC) uses BN to characterize the relationships between attributes and the class labels, where a simplified approach is to employ a conditional independence assumption between attributes and the corresponding class labels, i.e., the Naive Bayes (NB) classification model. One major approach to mitigate NB's primary weakness (the conditional independence assumption) is the attribute weighting, and this type of approach has been proved to be effective for NB with simple structure. However, for weighted BNCs involving complex structures, in which attribute weighting is embedded into the model, there is no existing study on whether the weighting will work for complex BNCs and how effective it will impact on the learning of a given task. In this paper, we first survey several complex structure models for BNCs, and then carry out experimental studies to investigate the effectiveness of the attribute weighting strategies for complex BNCs, with a focus on Hidden Naive Bayes (HNB) and Averaged One-Dependence Estimation (AODE). Our studies use classification accuracy (ACC), area under the ROC curve ranking (AUC), and conditional log likelihood (CLL), as the performance metrics. Experiments and comparisons on 36 benchmark data sets demonstrate that attribute weighting technologies just slightly outperforms unweighted complex BNCs with respect to the ACC and AUC, but significant improvement can be observed using CLL.
Zhang, Y., Zhang, W., Lin, X., Cheema, M.A. & Zhang, C. 2014, 'Matching dominance: capture the semantics of dominance for multi-dimensional uncertain objects', Conference on Scientific and Statistical Database Management, SSDBM '14, Aalborg, Denmark, June 30 - July 02, 2014, pp. 18-18.
View/Download from: UTS OPUS or Publisher's site
Wu, J., Hong, Z., Pan, S., Zhu, X., Cai, Z. & Zhang, C. 2014, 'Exploring Features for Complicated Objects: Cross-View Feature Selection for Multi-Instance Learning', Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, ACM, pp. 1699-1708.
View/Download from: UTS OPUS or Publisher's site
Wu, J., Hong, Z., Pan, S., Zhu, X., Cai, Z. & Zhang, C. 2014, 'Multi-graph-view Learning for Graph Classification', Proceedings of the 2014 IEEE International Conference on Data Mining (ICDM), IEEE International Conference on Data Mining, IEEE, Shenzhen, China, pp. 590-599.
View/Download from: UTS OPUS or Publisher's site
Graph classification has traditionally focused on graphs generated from a single feature view. In many applications, it is common to have useful information from different channels/views to describe objects, which naturally results in a new representation with multiple graphs generated from different feature views being used to describe one object. In this paper, we formulate a new Multi-Graph-View learning task for graph classification, where each object to be classified contains graphs from multiple graph-views. This problem setting is essentially different from traditional single-graph-view graph classification, where graphs are from one single feature view. To solve the problem, we propose a Cross Graph-View Sub graph Feature based Learning (gCGVFL) algorithm that explores an optimal set of sub graphs, across multiple graph-views, as features to represent graphs. Specifically, we derive an evaluation criterion to estimate the discriminative power and the redundancy of sub graph features across all views, and assign proper weight values to each view to indicate its importance for graph classification. The iterative cross graph-view sub graph scoring and graph-view weight updating form a closed loop to find optimal sub graphs to represent graphs for multi-graph-view learning. Experiments and comparisons on real-world tasks demonstrate the algorithm's performance.
Qin, L., Yu, J.X., Chang, L., Cheng, H., Zhang, C. & Lin, X. 2014, 'Scalable big graph processing in MapReduce', International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, June 22-27, 2014, pp. 827-838.
View/Download from: UTS OPUS or Publisher's site
Zhang, C., Guo, T., Zhu, X.Q. & Pei, J. 2014, 'SNOC: Streaming Network Node Classification', Proceedings of the 2014 IEEE International Conference on Data Mining (ICDM), 2014 IEEE International Conference on Data Mining (ICDM), pp. 150-159.
View/Download from: Publisher's site
Many real-world networks are featured with dynamic changes, such as new nodes and edges, and modification of the node content. Because changes are continuously introduced to the network in a streaming fashion, we refer to such dynamic networks as streaming networks. In this paper, we propose a new classification method for streaming networks, namely streaming network node classification (SNOC). For streaming networks, the essential challenge is to properly capture the dynamic changes of the node content and node interactions to support node classification. While streaming networks are dynamically evolving, for a short temporal period, a subset of salient features are essentially tied to the network content and structures, and therefore can be used to characterize the network for classification. To achieve this goal, we propose to carry out streaming network feature selection (SNF) from the network, and use selected features as gauge to classify unlabeled nodes. A Laplacian based quality criterion is proposed to guide the node classification, where the Laplacian matrix is generated based on node labels and structures. Node classification is achieved by finding the class that results in the minimal gauging value with respect to the selected features. By frequently updating the features selected from the network, node classification can quickly adapt to the changes in the network for maximal performance gain. Experiments demonstrate that SNOC is able to capture changes in network structures and node content, and outperforms baseline approaches with significant performance gain.
Pan, S., Zhu, X., Zhang, C. & Yu, P. 2013, 'Graph Stream Classification using Labeled and Unlabeled Graphs', Proceedings of the 29th IEEE International Conference on Data Engineering, IEEE International Conference on Data Engineering, IEEE, Brisbane, Australia, pp. 398-409.
View/Download from: UTS OPUS or Publisher's site
Graph Stream Classification using Labeled and Unlabeled Graphs
Wan, L., Chen, L. & Zhang, C. 2013, 'Mining Frequent Serial Episodes over Uncertain Sequence Data', The 16th International Conference on Extending Database Technology (EDBT 2013), International Conference on Extending Database Technology, ACM EDBT/ICDT 2013, Genoa, Italy, pp. 215-226.
View/Download from: UTS OPUS or Publisher's site
Data uncertainty has posed many unique challenges to nearly all types of data mining tasks, creating a need for uncertain data mining. In this paper, we focus on the particular task of mining probabilistic frequent serial episodes (P-FSEs) from uncertain sequence data, which applies to many real applications including sensor readings as well as customer purchase sequences. We first define the notion of P-FSEs, based on the frequentness probabilities of serial episodes under possible world semantics. To discover P-FSEs over an uncertain sequence, we propose: 1) an exact approach that computes the accurate frequentness probabilities of episodes; 2) an approximate approach that approximates the frequency of episodes using probability models; 3) an optimized approach that efficiently prunes a candidate episode by estimating an upper bound of its frequentness probability using approximation techniques. We conduct extensive experiments to evaluate the performance of the developed data mining algorithms. Our experimental results show that: 1) while existing research demonstrates that approximate approaches are orders of magnitudes faster than exact approaches, for P-FSE mining, the efficiency improvement of the approximate approach over the exact approach is marginal; 2) although it has been recognized that the normal distribution based approximation approach is fairly accurate when the data set is large enough, for P-FSE mining, the binomial distribution based approximation achieves higher accuracy when the the number of episode occurrences is limited; 3) the optimized approach clearly outperforms the other two approaches in terms of the runtime, and achieves very high accuracy.
Wan, L., Chen, L. & Zhang, C. 2013, 'Mining Dependent Frequent Serial Episodes from Uncertain Sequence Data', Proceedings of the13th IEEE International Conference on Data Mining, IEEE International Conference on Data Mining, IEEE Computer Society Press, Dallas, TX, USA, pp. 1211-1216.
View/Download from: UTS OPUS or Publisher's site
In this paper, we focus on the problem of mining Probabilistic Dependent Frequent Serial Episodes (P-DFSEs) from uncertain sequence data. By observing that the frequentness probability of an episode in an uncertain sequence is a Markov Chain imbeddable variable, we first propose an Embedded Markov Chain-based algorithm that efficiently computes the frequentness probability of an episode by projecting the probability space into a set of limited partitions. To further improve the computation efficiency, we devise an optimized approach that prunes candidate episodes early by estimating the upper bound of their frequentness probabilities.
Wu, J., Zhu, X., Zhang, C. & Cai, Z. 2013, 'Multi-instance Multi-graph Dual Embedding Learning', 2013 IEEE 13th International Conference on Data Mining, IEEE International Conference on Data Mining, IEEE, Dallas, TX, USA, pp. 827-836.
View/Download from: UTS OPUS or Publisher's site
Multi-instance Multi-graph Dual Embedding Learning
Fang, M., Yin, J., Zhu, X. & Zhang, C. 2013, 'Active Class Discovery and Learning for Networked Data', Website proceedings of 13th SIAM International Conference on Data Mining, SIAM International Conference on Data Mining, SIAM, Austin, Texas, USA, pp. 315-323.
View/Download from: UTS OPUS
Active learning, networked data, class discovery
Liu, C., Chen, L. & Zhang, C. 2013, 'Mining Probabilistic Representative Frequent Patterns From Uncertain Data', The 13th SIAM International Conference on Data Mining (SDM 2013), SIAM International Conference on Data Mining, SIAM / Omnipress, Austin, Texas, USA, pp. 1-9.
View/Download from: UTS OPUS or Publisher's site
Probabilistic frequent pattern mining over uncertain data has received a great deal of attention recently due to the wide applications of uncertain data. Similar to its counterpart in deterministic databases, however, probabilistic frequent pattern mining suffers from the same problem of generating an exponential number of result patterns. The large number of discovered patterns hinders further evaluation and analysis, and calls for the need to find a small number of representative patterns to approximate all other patterns. This paper formally defines the problem of probabilistic representative frequent pattern (P-RFP) mining, which aims to find the minimal set of patterns with sufficiently high probability to represent all other patterns. The problem's bottleneck turns out to be checking whether a pattern can probabilistically represent another, which involves the computation of a joint probability of supports of two patterns. To address the problem, we propose a novel and efficient dynamic programming-based approach. Moreover, we have devised a set of effective optimization strategies to further improve the computation efficiency. Our experimental results demonstrate that the proposed P-RFP mining effectively reduces the size of probabilistic frequent patterns. Our proposed approach not only discovers the set of P-RFPs efficiently, but also restores the frequency probability information of patterns with an error guarantee.
Liu, C., Chen, L. & Zhang, C. 2013, 'Summarizing Probabilistic Frequent Patterns: A Fast Approach', Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD13), ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, Chicago, Illinois USA, pp. 527-535.
View/Download from: UTS OPUS or Publisher's site
Mining probabilistic frequent patterns from uncertain data has received a great deal of attention in recent years due to the wide applications. However, probabilistic frequent pattern mining suffers from the problem that an exponential number of result patterns are generated, which seriously hinders further evaluation and analysis. In this paper, we focus on the problem of mining probabilistic representative frequent patterns (P-RFP), which is the minimal set of patterns with adequately high probability to represent all frequent patterns. Observing the bottleneck in checking whether a pattern can probabilistically represent another, which involves the computation of a joint probability of the supports of two patterns, we introduce a novel approximation of the joint probability with both theoretical and empirical proofs. Based on the approximation, we propose an Approximate P-RFP Mining (APM) algorithm, which effectively and efficiently compresses the set of probabilistic frequent patterns. To our knowledge, this is the first attempt to analyze the relationship between two probabilistic frequent patterns through an approximate approach. Our experiments on both synthetic and real-world datasets demonstrate that the APM algorithm accelerates P-RFP mining dramatically, orders of magnitudes faster than an exact solution. Moreover, the error rate of APM is guaranteed to be very small when the database contains hundreds transactions, which further affirms APM is a practical solution for summarizing probabilistic frequent patterns.
Qin, Z., Wang, A.T., Zhang, C. & Zhang, S. 2013, 'Cost-sensitive classification with k-nearest neighbors', Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), pp. 112-131.
View/Download from: UTS OPUS or Publisher's site
Cost-sensitive learning algorithms are typically motivated by imbalance data in clinical diagnosis that contains skewed class distribution. While other popular classification methods have been improved against imbalance data, it is only unsolved to extend k-Nearest Neighbors (kNN) classification, one of top-10 datamining algorithms, to make it cost-sensitive to imbalance data. To fill in this gap, in this paper we study two simple yet effective cost-sensitive kNN classification approaches, called Direct-CS-kNN and Distance-CS-kNN. In addition, we utilize several strategies (i.e., smoothing, minimum-cost k value selection, feature selection and ensemble selection) to improve the performance of Direct-CS-kNN and Distance-CS-kNN. We conduct several groups of experiments to evaluate the efficiency with UCI datasets, and demonstrate that the proposed cost-sensitive kNN classification algorithms can significantly reduce misclassification cost, often by a large margin, as well as consistently outperform CS-4.5 with/without additional enhancements. © 2013 Springer-Verlag Berlin Heidelberg.
Liu, C., Chen, L. & Zhang, C. 2013, 'Mining probabilistic representative frequent patterns from uncertain data', Proceedings of the 2013 SIAM International Conference on Data Mining, SDM 2013, pp. 73-81.
Copyright © SIAM. Probabilistic frequent pattern mining over uncertain data has received a great deal of attention recently due to the wide applications of uncertain data. Similar to its counterpart in deterministic databases, however, probabilistic frequent pattern mining suffers from the same problem of generating an exponential number of result patterns. The large number of discovered patterns hinders further evaluation and analysis, and calls for the need to find a small number of representative patterns to approximate all other patterns. This paper formally defines the problem of probabilistic representative frequent pattern (P-RFP) mining, which aims to find the minimal set of patterns with sufficiently high probability to represent all other patterns. The problem's bottleneck turns out to be checking whether a pattern can probabilistically represent another, which involves the computation of a joint probability of supports of two patterns. To address the problem, we propose a novel and efficient dynamic programming-based approach. Moreover, we have devised a set of effective optimization strategies to further improve the computation efficiency. Our experimental results demonstrate that the proposed P-RFP mining effectively reduces the size of probabilistic frequent patterns. Our proposed approach not only discovers the set of P-RFPs efficiently, but also restores the frequency probability information of patterns with an error guarantee.
Liang, G. & Zhang, C. 2012, 'An Effcient and Simple Under-sampling Technique for Imbalanced Time Series Classification', Proceedings of the 21st ACM International Conference on Information and Knowledge Management, CIKM'12, ACM, Maui, Hawaii, USA, pp. 2339-2342.
View/Download from: UTS OPUS
Imbalanced time series classification (TSC) involving many real-world applications has increasingly captured attention of researchers. Previous work has proposed an intelligent structure preserving over-sampling method (SPO), which the authors claimed achieved better performance than other existing over-sampling and state-of-the-art methods in TSC. The main disadvantage of over-sampling methods is that they significantly increase the computational cost of training a classification model due to the addition of new minority class instances to balance data-sets with high dimensional features. These challenging issues have motivated us to find a simple and efficient solution for imbalanced TSC. Statistical tests are applied to validate our conclusions. The experimental results demonstrate that this proposed simple random under-sampling technique with SVM is efficient and can achieve results that compare favorably with the existing complicated SPO method for imbalanced TSC.
Su, G., Ying, M. & Zhang, C. 2012, 'Semantic Analysis of Component-aspect Dynamism for Connector-based Architecture Styles', 2012 Joint Working Conference on Software Architecture & 6th European Conference on Software Architecture, Joint Working Conference on Software Architecture & 6th European Conference on Software Architecture, IEEE Computer Society, Helsinki (Finland), pp. 151-160.
View/Download from: UTS OPUS or Publisher's site
Architecture Description Languages usually specify software architectures in the levels of types and instances. Components instantiate component types by parameterization and type conformance. Behavioral analysis of dynamic architectures needs to deal with the uncertainty of actual configurations of components, even if the type-level architectural descriptions are explicitly provided. This paper addresses this verification difficulty for connector-based architecture styles, in which all communication channels of a system are between components and a connector. The contribution of this paper is two-fold: (1) We propose a process-algebraic model, in which the main architectural concepts (such as component type and component conformance) and several fundamental architectural properties (i.e. deadlock-freedom, non-starvation, conservation, and completeness) are formulated. (2)We demonstrate that the state space of verification of these properties can be reduced from the entire universe of possible configurations to specific configurations that are fixed according to the typelevel architectural descriptions.
Liang, G. & Zhang, C. 2012, 'A Comparative Study of Sampling Methods and Algorithms for Imbalanced Time Series Classification', Lecture Notes in Computer Science, Springer-Verlag, Sydney, pp. 637-648.
View/Download from: UTS OPUS or Publisher's site
Mining time series data and imbalanced data are two of ten challenging problems in data mining research. Imbalanced time series classification (ITSC) involves these two challenging problems, which take place in many real world applications. In the existing research, the structure-preserving over-sampling (SOP) method has been proposed for solving the ITSC problems. It is claimed by its authors to achieve better performance than other over-sampling and state-of-the-art methods in time series classification (TSC). However, it is unclear whether an under-sampling method with various learning algorithms is more effective than over-sampling methods, e.g., SPO for ITSC, because research has shown that under-sampling methods are more effective and efficient than over-sampling methods. We propose a comparative study between an under-sampling method with various learning algorithms and oversampling methods, e.g. SPO. Statistical tests, the Friedman test and post-hoc test are applied to determine whether there is a statistically significant difference between methods. The experimental results demonstrate that the under-sampling technique with KNN is the most effective method and can achieve results that are superior to the existing complicated SPO method for ITSC.
Long, G., Chen, L., Zhu, X. & Zhang, C. 2012, 'TCSST: transfer classification of short & sparse text using external data', Proc. Of The 21st ACM Conference on Information and Knowledge Management (CIKM-12), ACM Conference on Information and Knowledge Management, ACM, Hawaii, USA, pp. 764-772.
View/Download from: UTS OPUS or Publisher's site
Short & sparse text is becoming more prevalent on the web, such as search snippets, micro-blogs and product reviews. Accurately classifying short & sparse text has emerged as an important while challenging task. Existing work has considered utilizing external data (e.g. Wikipedia) to alleviate data sparseness, by appending topics detected from external data as new features. However, training a classifier on features concatenated from different spaces is not easy considering the features have different physical meanings and different significance to the classification task. Moreover, it exacerbates the "curse of dimensionality" problem. In this study, we propose a transfer classification method, TCSST, to exploit the external data to tackle the data sparsity issue. The transfer classifier will be learned in the original feature space. Considering that the labels of the external data may not be readily available or sufficiently enough, TCSST further exploits the unlabeled external data to aid the transfer classification. We develop novel strategies to allow TCSST to iteratively select high quality unlabeled external data to help with the classification. We evaluate the performance of TCSST on both benchmark as well as real-world data sets. Our experimental results demonstrate that the proposed method is effective in classifying very short & sparse text, consistently outperforming existing and baseline methods
Fang, M., Zhu, X. & Zhang, C. 2012, 'Active Learning from Oracle with Knowledge Blind Spot', Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence, AAAI Press, Toronto, Ontario, Canada, pp. 2421-2422.
View/Download from: UTS OPUS
Active learning traditionally assumes that an oracle is capable of providing labeling information for each query instance. This paper formulates a new research problem which allows an oracle admit that he/she is incapable of labeling some query instances or simply answer âI donât know the labelâ. We define a unified objective function to ensure that each query instance submitted to the oracle is the one mostly needed for labeling and the oracle should also has the knowledge to label. Experiments based on dierent types of knowledge blind spot (KBS) models demonstrate the eectiveness of the proposed design.
Li, B., Zhu, X., Chi, L. & Zhang, C. 2012, 'Nested Subtree Hash Kernels for Large-Scale Graph Classification over Streams', 12th IEEE International Conference on Data Mining, ICDM 2012, 2012 IEEE 12th International Conference on Data Mining, IEEE Computer Society, Brussels, Belgium, pp. 399-408.
View/Download from: UTS OPUS or Publisher's site
Most studies on graph classification focus on designing fast and effective kernels. Several fast subtree kernels have achieved a linear time-complexity w.r.t. the number of edges under the condition that a common feature space (e.g., a subtree pattern list) is needed to represent all graphs. This will be infeasible when graphs are presented in a stream with rapidly emerging subtree patterns. In this case, computing a kernel matrix for graphs over the entire stream is difficult since the graphs in the expired chunks cannot be projected onto the unlimitedly expanding feature space again. This leads to a big trouble for graph classification over streams -- Different portions of graphs have different feature spaces. In this paper, we aim to enable large-scale graph classification over streams using the classical ensemble learning framework, which requires the data in different chunks to be in the same feature space. To this end, we propose a Nested Subtree Hashing (NSH) algorithm to recursively project the multi-resolution subtree patterns of different chunks onto a set of common low-dimensional feature spaces. We theoretically analyze the derived NSH kernel and obtain a number of favorable properties: 1) The NSH kernel is an unbiased and highly concentrated estimator of the fast subtree kernel. 2) The bound of convergence rate tends to be tighter as the NSH algorithm steps into a higher resolution. 3) The NSH kernel is robust in tolerating concept drift between chunks over a stream. We also empirically test the NSH kernel on both a large-scale synthetic graph data set and a real-world chemical compounds data set for anticancer activity prediction. The experimental results validate that the NSH kernel is indeed efficient and robust for graph classification over streams.
Liang, G. & Zhang, C. 2012, 'An efficient and simple under-sampling technique for imbalanced time series classification', Proceedings of the 21st ACM international conference on Information and knowledge management, CIKM 2012, ACM, Maui, Hawaii, pp. 2339-2342.
View/Download from: UTS OPUS or Publisher's site
Imbalanced time series classification (TSC) involving many real-world applications has increasingly captured attention of researchers. Previous work has proposed an intelligent-structure preserving over-sampling method (SPO), which the authors claimed achieved better performance than other existing over-sampling and state-of-the-art methods in TSC. The main disadvantage of over-sampling methods is that they significantly increase the computational cost of training a classification model due to the addition of new minority class instances to balance data-sets with high dimensional features. These challenging issues have motivated us to find a simple and efficient solution for imbalanced TSC. Statistical tests are applied to validate our conclusions. The experimental results demonstrate that this proposed simple random under-sampling technique with SVM is efficient and can achieve results that compare favorably with the existing complicated SPO method for imbalanced TSC.
Liang, G., Zhu, X. & Zhang, C. 2011, 'An Empirical Study of Bagging Predictors for Different Learning Algorithms', Proceedings of the Twenty-Fifth AAAI Conference on Artificial Intelligence, National Conference of the American Association for Artificial Intelligence, AAAI Press, San Francisco, California, US, pp. 1802-1803.
View/Download from: UTS OPUS
Bagging is a simple yet effective design which combines multiple single learners to form an ensemble for prediction. Despite its popular usage in many real-world applications, existing research is mainly concerned with studying unstable learners as the key to ensure the performance gain of a bagging predictor, with many key factors remaining unclear. For example, it is not clear when a bagging predictor can outperform a single learner and what is the expected performance gain when different learning algorithms were used to form a bagging predictor. In this paper, we carry out comprehensive empirical studies to evaluate bagging predictors by using 12 different learning algorithms and 48 benchmark data-sets. Our analysis uses robustness and stability decompositions to characterize different learning algorithms, through which we rank all learning algorithms and comparatively study their bagging predictors to draw conclusions. Our studies assert that both stability and robustness are key requirements to ensure the high performance for building a bagging predictor. In addition, our studies demonstrated that bagging is statistically superior to most single base learners, except for KNN and NaÃve Bayes (NB). Multi-layer perception (MLP), NaÃve Bayes Trees (NBTree), and PART are the learning algorithms with the best bagging performance.
Xiao, Y., Liu, B., Yin, J., Cao, L., Zhang, C. & Hao, Z. 2011, 'Similarity-Based Approach for Positive and Unlabeled Learning', Proceedings of the 22nd International Joint Conference on Artificial Intelligence, International Joint Conference on Artificial Intelligence, AAAI Press, Barcelona, Catalonia, Spain, pp. 1577-1582.
View/Download from: UTS OPUS
Positive and unlabelled learning (PU learning) has been investigated to deal with the situation where only the positive examples and the unlabelled examples are available. Most of the previous works focus on identifying some negative examples from the unlabelled data, so that the supervised learning methods can be applied to build a classifier. However, for the remaining unlabelled data, which can not be explicitly identified as positive or negative (we call them ambiguous examples), they either exclude them from the training phase or simply enforce them to either class. Consequently, their performance may be constrained. This paper proposes a novel approach, called similarity-based PU learning (SPUL) method, by associating the ambiguous examples with two similarity weights, which indicate the similarity of an ambiguous example towards the positive class and the negative class, respectively. The local similarity-based and global similarity-based mechanisms are proposed to generate the similarity weights. The ambiguous examples and their similarity-weights are thereafter incorporated into an SVM-based learning phase to build a more accurate classifier. Extensive experiments on real-world datasets have shown that SPUL outperforms state-of-the-art PU learning
Li, B., Zhu, X., Li, R., Zhang, C., Xue, X. & Wu, X. 2011, 'Cross-Domain Collaborative Filtering over Time.', Proceedings of the 22nd International Joint Conference on Artificial Intelligence, International Joint Conference on Artificial Intelligence, AAAI Press, Barcelona, Catalonia, Spain, pp. 2293-2298.
View/Download from: UTS OPUS or Publisher's site
Collaborative filtering (CF) techniques recommend items to users based on their historical ratings. In real-world scenarios, user interests may drift over time since they are affected by moods, contexts, and pop culture trends. This leads to the fact that a userâs historical ratings comprise many aspects of user interests spanning a long time period. However, at a certain time slice, one userâs interest may only focus on one or a couple of aspects. Thus, CF techniques based on the entire historical ratings may recommend inappropriate items. In this paper, we consider modeling user-interest drift over time based on the assumption that each user has multiple counterparts over temporal domains and successive counterparts are closely related. We adopt the cross-domain CF framework to share the static group-level rating matrix across temporal domains, and let user-interest distribution over item groups drift slightly between successive temporal domains. The derived method is based on a Bayesian latent factor model which can be inferred using Gibbs sampling. Our experimental results show that our method can achieve state-of-the-art recommendation performance as well as explicitly track and visualize user-interest drift over time.
Luo, C., Zhao, Y., Luo, D., Zhang, C. & Cao, W. 2011, 'Agent-Based Subspace Clustering', Advances in Knowledge Discovery and Data Mining - 15th Pacific-Asia Conference, Advances in Knowledge Discovery and Data Mining - 15th Pacific-Asia Conference, PAKDD, Springer-Verlag, Shenzhen, China, pp. 370-381.
View/Download from: UTS OPUS or Publisher's site
This paper presents an agent-based algorithm for discovering subspace clusters in high dimensional data. Each data object is represented by an agent, and the agents move from one local environment to another to find optimal clusters in subspaces. Heuristic rules and objective functions are defined to guide the movements of agents, so that similar agents(data objects) go to one group. The experimental results show that our proposed agent-based subspace clustering algorithm performs better than existing subspace clustering methods on both F1 measure and Entropy. The running time of our algorithm is scalable with the size and dimensionality of data. Furthermore, an application in stock market surveillance demonstrates its effectiveness in real world applications.
Chen, L. & Zhang, C. 2011, 'Semi-supervised Variable Weighting for Clustering', Proceedings of the Eleventh SIAM International Conference on Data Mining, SDM, SIAM / Omnipress, Mesa, Arizona, USA, pp. 863-871.
View/Download from: UTS OPUS
Semi-supervised learning, which uses a small amount of labeled data in conjunction with a large amount of unlabeled data for training, has recently attracted huge research attention due to the considerable improvement in learning accuracy. In this work, we focus on semi- supervised variable weighting for clustering, which is a critical step in clustering as it is known that interesting clustering structure usually occurs in a subspace defined by a subset of variables. Besides exploiting both labeled and unlabeled data to effectively identify the real importance of variables, our method embeds variable weighting in the process of semi-supervised clustering, rather than calculating variable weights separately, to ensure the computation efficiency. Our experiments carried out on both synthetic and real data demonstrate that semi-supervised variable weighting signicantly improves the clustering accuracy of existing semi-supervised k-means without variable weighting, or with unsupervised variable weighting.
Liang, G., Zhu, X. & Zhang, C. 2011, 'An Empirical Study of Bagging Predictors for Imbalanced Data with Different Levels of Class Distribution', AI 2011: Advances in Artificial Intelligence, Australasian Joint Conference on Artificial Intelligence, Springer-Verlag Berlin / Heidelberg, Perth, Australia, pp. 213-222.
View/Download from: UTS OPUS or Publisher's site
Research into learning from imbalanced data has increasingly captured the attention of both academia and industry, especially when the class distribution is highly skewed. This paper compares the Area Under the Receiver Operating Characteristic Curve (AUC) performance of bagging in the context of learning from different imbalanced levels of class distribution. Despite the popularity of bagging in many real-world applications, some questions have not been clearly answered in the existing research, e.g., which bagging predictors may achieve the best performance for applications, and whether bagging is superior to single learners when the levels of class distribution change. We perform a comprehensive evaluation of the AUC performance of bagging predictors with 12 base learners at different imbalanced levels of class distribution by using a sampling technique on 14 imbalanced data-sets. Our experimental results indicate that Decision Table (DTable) and RepTree are the learning algorithms with the best bagging AUC performance. Most AUC performances of bagging predictors are statistically superior to single learners, except for Support Vector Machines (SVM) and Decision Stump (DStump).
Liang, G. & Zhang, C. 2011, 'An Empirical Evaluation of Bagging with Different Algorithms on Imbalanced Data', Advanced Data Mining and Applications. Lecture Notes in Artificial Intelligence 7120, International Conference on Advanced Data Mining and Applications, Springer-Verlag Berlin / Heidelberg, Beijing, China, pp. 339-352.
View/Download from: UTS OPUS or Publisher's site
This study investigates the effectiveness of bagging with respect to different learning algorithms on Imbalanced data-sets. The purpose of this research is to investigate the performance of bagging based on two unique approaches: (1) classify base learners with respect to 12 different learning algorithms in general terms, and (2) evaluate the performance of bagging predictors on data with imbalanced class distributions. The former approach develops a method to categorize base learners by using two-dimensional robustness and stability decomposition on 48 benchmark data-sets; while the latter approach investigates the performance of bagging predictors by using evaluation metrics, True Positive Rate (TPR), Geometric mean (G-mean) for the accuracy on the majority and minority classes, and the Receiver Operating Characteristic (ROC) curve on 12 imbalanced data-sets. Our studies assert that both stability and robustness are important factors for building high performance bagging predictors on data with imbalanced class distributions. The experimental results demonstrated that PART and Multi-layer Proceptron (MLP) are the learning algorithms with the best bagging performance on 12 imbalanced data-sets. Moreover, only four out of 12 bagging predictors are statistically superior to single learners based on both G-mean and TPR evaluation metrics over 12 imbalanced data-sets.
Dong, X., Zheng, Z., Cao, L., Zhao, Y., Zhang, C., Li, J., Wei, W. & Ou, Y. 2011, 'e-NSP: efficient negative sequential pattern mining based on identified positive patterns without database rescanning', Proceedings of the 20th ACM International Conference on Information and Knowledge Management, ACM international conference on Information and knowledge management, ACM, Glasgow, Scotland, UK, pp. 825-830.
View/Download from: UTS OPUS
Mining Negative Sequential Patterns (NSP) is much more challenging than mining Positive Sequential Patterns (PSP) due to the high computational complexity and huge search space required in calculating Negative Sequential Candidates (NSC). Very few approaches are available for mining NSP, which mainly rely on re-scanning databases after identifying PSP. As a result, they are very ine?cient. In this paper, we propose an e?cient algorithm for mining NSP, called e-NSP, which mines for NSP by only involving the identi?ed PSP, without re-scanning databases. First, negative containment is de?ned to determine whether or not a data sequence contains a negative sequence. Second, an e?cient approach is proposed to convert the negative containment problem to a positive containment problem. The supports of NSC are then calculated based only on the corresponding PSP. Finally, a simple but e?cient approach is proposed to generate NSC. With e-NSP, mining NSP does not require additional database scans, and the existing PSP mining algorithms can be integrated into e-NSP to mine for NSP e?ciently. eNSP is compared with two currently available NSP mining algorithms on 14 synthetic and real-life datasets. Intensive experiments show that e-NSP takes as little as 3% of the runtime of the baseline approaches and is applicable for efficient mining of NSP in large datasets.
Li, J., Bian, W., Tao, D. & Zhang, C. 2011, 'Learning Colours from Textures by Sparse Manifold Embedding', Lecture Notes in Artificial Intelligence.AI 2011: Advances in Artificial Intelligence.24th Australasian Joint Conference, AI 2011: Advances in Artificial Intelligence.24th Australasian Joint Conference, Springer-Verlag Berlin / Heidelberg, Perth, Australia, pp. 600-608.
View/Download from: UTS OPUS or Publisher's site
The capability of inferring colours from the texture (grayscale contents) of an image is useful in many application areas, when the imaging device/environment is limited. Traditional colour assignment involves intensive human effort. Automatic methods have been proposed to establish relations between image textures and the corresponding colours. Existing research mainly focuses on linear relations. In this paper, we employ sparse constraints in the model of texture-colour relationship. The technique is developed on a locally linear model, which assumes manifold assumption of the distribution of the image data. Given the texture of an image patch, learning the model transfers colours to the texture patch by combining known colours of similar texture patches. The sparse constraint checks the contributing factors in the model and helps improve the stability of the colour transfer. Experiments show that our method gives superior results to those of the previous work.
Fu, Y., Li, B., Zhu, X. & Zhang, C. 2011, 'Do they belong to the same class: active learning by querying pairwise label homogeneity', CIKM '11 Proceedings of the 20th ACM international conference on Information and knowledge management, ACM international conference on Information and knowledge managemen, ACM, Glasgow, Scotland, pp. 2161-2164.
View/Download from: UTS OPUS or Publisher's site
Traditional active learning methods request experts to provide ground truths to the queried instances, which can be expensive in practice. An alternative solution is to ask nonexpert labelers to do such labeling work, which can not tell the definite class labels. In this paper, we propose a new active learning paradigm, in which a nonexpert labeler is only asked "whether a pair of instances belong to the same class". To instantiate the proposed paradigm, we adopt the MinCut algorithm as the base classifier. We first construct a graph based on the pairwise distance of all the labeled and unlabeled instances and then repeatedly update the unlabeled edge weights on the max-flow paths in the graph. Finally, we select an unlabeled subset of nodes with the highest prediction confidence as the labeled data, which are included into the labeled data set to learn a new classifier for the next round of active learning. The experimental results and comparisons, with state-of-the-art methods, demonstrate that our active learning paradigm can result in good performance with nonexpert labelers
Liang, G. & Zhang, C. 2011, 'Empirical Study of Bagging Predictors on Medical Data', Volume 121 - Ninth Australasian Data Mining Conference, Australasian Data Mining Conference, ACS, Ballarat, Australia, pp. 31-40.
View/Download from: UTS OPUS
This study investigates the performance of bagging in terms of learning from imbalanced medical data. It is important for data miners to achieve highly accurate prediction models, and this is especially true for imbalanced medical applications. In these situations, practitioners are more interested in the minority class than the majority class; however, it is hard for a traditional supervised learning algorithm to achieve a highly accurate prediction on the minority class, even though it might achieve better results according to the most commonly used evaluation metric, Accuracy. Bagging is a simple yet effective ensemble method which has been applied to many real-world applications. However, some questions have not been well answered, e.g., whether bagging outperforms single learners on medical data-sets; which learners are the best predictors for each medical data-set; and what is the best predictive performance achievable for each medical data-set when we apply sampling techniques. We perform an extensive empirical study on the performance of 12 learning algorithms on 8 medical data-sets based on four performance measures: True Positive Rate (TPR), True Negative Rate (TNR), Geometric Mean (G-mean) of the accuracy rate of the majority class and the minority class, and Accuracy as evaluation metrics. In addition, the statistical analyses performed instil confidence in the validity of the conclusions of this research.
Yang, T., Cao, L. & Zhang, C. 2010, 'A Novel Prototype Reduction Method for the K-Nearest Neighbor Algorithm with K >= 1', Advances in Knowledge Discovery and Data Mining - Lecture Notes in Artificial Intelligence, Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer Berlin / Heidelberg, Hyderabad, India, pp. 89-100.
View/Download from: UTS OPUS or Publisher's site
In this paper, a novel prototype reduction algorithm is proposed, which aims at reducing the storage requirement and enhancing the online speed while retaining the same level of accuracy for a K-nearest neighbor (KNN) classifier. To achieve this goal, our proposed algorithm learns the weighted similarity function for a KNN classifier by maximizing the leave-one-out cross-validation accuracy. Unlike the classical methods PW, LPD and WDNN which can only work with K>=1, our developed algorithm can work with K>=1. This flexibility allows our learning algorithm to have superior classification accuracy and noise robustness. The proposed approach is assessed through experiments with twenty real world benchmark data sets. In all these experiments, the proposed approach shows it can dramatically reduce the storage requirement and online time for KNN while having equal or better accuracy than KNN, and it also shows comparable results to several prototype reduction methods proposed in literature.
Yang, T., Kecman, V., Cao, L. & Zhang, C. 2010, 'Combining Support Vector Machines and the t-statistic for Gene Selection in DNA Microarray Data Analysis', Advances in Knowledge Discovery and Data Mining - Lecture Notes in Artificial Intelligence, Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer Berlin / Heidelberg, Hyderabad, India, pp. 55-62.
View/Download from: UTS OPUS or Publisher's site
This paper proposes a new gene selection (or feature selection) method for DNA microarray data analysis. In the method, the t-statistic and support vector machines are combined efficiently. The resulting gene selection method uses both the data intrinsic information and learning algorithm performance to measure the relevance of a gene in a DNA microarray. We explain why and how the proposed method works well. The experimental results on two benchmarking microarray data sets show that the proposed method is competitive with previous methods. The proposed method can also be used for other feature selection problems.
Qin, Z., Zhang, C., Wang, T. & Zhang, S. 2010, 'Cost Sensitive Classification in Data Mining', Advanced Data Mining and Applications - 6th International Conference, ADMA 2010, International Conference on Advanced Data Mining and Applications, Springer-Verlag, Chongqing, China, pp. 1-11.
View/Download from: UTS OPUS or Publisher's site
Cost-sensitive classification is one of mainstream research topics in data mining and machine learning that induces models from data with unbalance class distributions and impacts by quantifying and tackling the unbalance. Rooted in diagnosis data analysis applications, there are great many techniques developed for cost-sensitive learning. They are mainly focused on minimizing the total cost of misclassification costs, test costs, or other types of cost, or a combination among these costs. This paper introduces the up-to-date prevailing cost-sensitive learning methods and presents some research topics by outlining our two new results: lazy-learning and semi-learning strategies for cost-sensitive classifiers.
Yang, T., Kecman, V., Cao, L. & Zhang, C. 2010, 'Testing Adaptive Local Hyperplane for multi-class classification by double cross-validation', The 2010 International Joint Conference on Neural Networks (IJCNN), International Joint Conference on Neural Networks, IEEE, Barcelona, Spain, pp. 1-5.
View/Download from: UTS OPUS or Publisher's site
Adaptive Local Hyperplane (ALH) is a recently proposed classifier for the multi-class classification problems and it has shown encouraging performance in many pattern recognition problems. However, ALH's performance over many general classification datasets has only been tested by using a single loop of cross-validation procedure, where the whole datasets are used for both hyper-parameter determination and accuracy estimation. This procedure is appropriate for classifier performance comparison, but the produced results are likely to be optimistic for classifier accuracy estimation on new datasets. In this paper, we test the performance of ALH as well as several other benchmark classifiers by using two loops of cross-validation (a.k.a. double resampling) procedure, where the inner loop is used for hyper-parameter determination and the outer loop is used for accuracy estimation. With such a testing scheme, the classification accuracy of a tested classifier can be evaluated in a more strict way. The experimental results indicate the superior performance of the ALH classifier with respect to the traditional classifiers including Support Vector Machine (SVM), K-Nearest Neighbor (KNN), Linear Discriminant Analysis (LDA), Classification Tree (Tree) and K-local Hyperplane distance Nearest Neighbor (HKNN). These results imply that the ALH classifier might become a useful tool for the pattern recognition tasks.
Cao, L., Luo, D. & Zhang, C. 2009, 'Ubiquitous Intelligence in Agent Mining', ADMI 2009, International Workshop on Agents and Data Mining Interaction, Springer, Budapest, Hungary, pp. 23-35.
View/Download from: UTS OPUS or Publisher's site
Agent mining, namely the interaction and integration of multi-agent and data mining, has emerged as a very promising research area. While many mutual issues exist in both multi-agent and data mining areas, most of them can be described in terms of or related to ubiquitous intelligence. It is certainly very important to define, specify, represent, analyze and utilize ubiquitous intelligence in agents, data mining, and agent mining. This paper presents a novel but preliminary investigation of ubiquitous intelligence in these areas. We specify five types of ubiquitous intelligence: data intelligence, human intelligence, domain intelligence, network and web intelligence, organizational intelligence, and social intelligence. We define and illustrate them, and discuss techniques for involving them into agents, data mining, and agent mining for complex problem-solving. Further investigation on involving and synthesizing ubiquitous intelligence into agents, data mining, and agent mining will lead to a disciplinary upgrade from methodological, technical and practical perspectives.
Zhao, Y., Zhang, H., Wu, S., Pei, J., Cao, L., Zhang, C. & Bohlscheid, H. 2009, 'Debt Detection in Social Security by Sequence Classification Using Both Positive and Negative Patterns', Machine Learning and Knowledge Discovery in Databases, European Conference, ECML PKDD 2009, European Conference on Machine Learning, Springer, Bled, Slovenia, pp. 648-663.
View/Download from: UTS OPUS or Publisher's site
Debt detection is important for improving payment accuracy in social security. Since debt detection from customer transactional data can be generally modelled as a fraud detection problem, a straightforward solution is to extract features from transaction sequences and build a sequence classifier for debts. The existing sequence classification methods based on sequential patterns consider only positive patterns. However, according to our experience in a large social security application, negative patterns are very useful in accurate debt detection. In this paper, we present a successful case study of debt detection in a large social security application. The central technique is building sequence classification using both positive and negative sequential patterns.
Zhao, Y., Zhang, H., Cao, L., Zhang, C. & Bohlscheid, H. 2009, 'Mining Both Positive and Negative Impact-Oriented Sequential Rules from Transactional Data', Advances in Knowledge Discovery and Data Mining, 13th Pacific-Asia Conference, PAKDD 2009, Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer, Bangkok, Thailand, pp. 656-663.
View/Download from: UTS OPUS
Traditional sequential pattern mining deals with positive correlation between sequential patterns only, without considering negative relationship between them. In this paper, we present a notion of impact-oriented negative sequential rules, in which the left side is a positive sequential pattern or its negation, and the right side is a predefined outcome or its negation. Impact-oriented negative sequential rules are formally defined to show the impact of sequential patterns on the outcome, and an efficient algorithm is designed to discover both positive and negative impact-oriented sequential rules. Experimental results on both synthetic data and real-life data show the efficiency and effectiveness of the proposed technique.
Zhu, X., Wu, X. & Zhang, C. 2009, 'Vague One-Class Learning for Data Streams', Proceedings of the 9th IEEE International Conference on Data Mining (ICDM-09), IEEE International Conference on Data Mining, IEEE Computer Society, Miami, Florida, pp. 657-666.
View/Download from: UTS OPUS or Publisher's site
In this paper, we formulate a new research problem of learning from vaguely labeled one-class data streams, where the main objective is to allow users to label instance groups, instead of single instances, as positive samples for learning. The batch-labeling, however, raises serious issues because labeled groups may contain non-positive samples, and users may change their labeling interests at any time. To solve this problem, we propose a Vague One-Class Learning (VOCL) framework which employs a double weighting approach, at both instance and classifier levels, to build an ensembling framework for learning. At instance level, both local and global filterings are considered for instance weight adjustment. Two solutions are proposed to take instance weight values into the classifier training process. At classifier level, a weight value is assigned to each classifier of the ensemble to ensure that learning can quickly adapt to usersâ interests. Experimental results on synthetic and real-world data streams demonstrate that the proposed VOCL framework significantly outperforms other methods for vaguely labeled one-class data streams.
Wang, W., Xiao, C., Lin, X. & Zhang, C. 2009, 'Efficient approximate entity extraction with edit distance constraints', Proceedings of the 35th SIGMOD international conference on Management of data, ACM Special Interest Group on Management of Data Conference, ACM, Rhode Island, USA, pp. 759-770.
View/Download from: UTS OPUS
Named entity recognition aims at extracting named entities from unstructured text. A recent trend of named entity recognition is finding approximate matches in the text with respect to a large dictionary of known entities, as the domain knowledge encoded in the dictionary helps to improve the extraction performance. In this paper, we study the problem of approximate dictionary matching with edit distance constraints. Compared to existing studies using token-based similarity constraints, our problem definition enables us to capture typographical or orthographical errors, both of which are common in entity extraction tasks yet may be missed by token-based similarity constraints. Our problem is technically challenging as existing approaches based on q-gram filtering have poor performance due to the existence of many short entities in the dictionary. Our proposed solution is based on an improved neighborhood generation method employing novel partitioning and prefix pruning techniques. We also propose an efficient document processing algorithm that minimizes unnecessary comparisons and enumerations and hence achieves good scalability. We have conducted extensive experiments on several publicly available named entity recognition datasets. The proposed algorithm outperforms alternative approaches by up to an order of magnitude.
Zhang, C. 2009, 'Developing Actionable Trading Strategies for Trading Agents', Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology, IEEE/WIC/ACM international Conference on Web Intelligence and Intelligent Agent Technology, IEEE Computer Society, Milano, Italy, pp. 9-9.
View/Download from: Publisher's site
Trading agents are useful for developing and back-testing quality trading strategies for taking actions in the real world. The existing trading agent research mainly focuses on simulation using artificial data. As a result, the actionable capability of developed trading strategies is often limited, and the trading agents therefore lack power. Actionable trading strategies can empower trading agents with workable decisionmaking in real-life markets. The development of actionable strategies is a non-trivial task, which needs to consider real-life constraints and organisational factors in the market. In this talk, we first analyse such constraints on developing actionable trading strategies for trading agents and propose a trading strategy development framework for trading agents. We then develop a series of trading strategies for trading agents through optimising, enhancing and discovering actionable trading strategies. We demonstrate working case studies using agent mining technology in real market data. These approaches, and their performance, are evaluated from both technical and business perspectives. These evalualtions clearly show that the development of trading strategies for trading agents, using our approach, can lead to smart decisions for brokerage firms and financial companies.
Xiao, Y., Liu, B., Cao, L., Wu, X., Zhang, C., Hao, Z., Yang, F. & Cao, J. 2009, 'Multi-sphere Support Vector Data for Outliers Detection on Multi-distribution Data', Data Mining Workshops, 2009. ICDMW '09. IEEE International Conference on, IEEE International Conference on Data Mining, IEEE Computer Society Press, Miami, Florida, pp. 82-87.
View/Download from: UTS OPUS or Publisher's site
SVDD has been proved a powerful tool for outlier detection. However, in detecting outliers on multi-distribution data, namely there are distinctive distributions in the data, it is very challenging for SVDD to generate a hyper-sphere for distinguishing outliers from normal data. Even if such a hyper-sphere can be identified, its performance is usually not good enough. This paper proposes an multi-sphere SVDD approach, named MS-SVDD, for outlier detection on multi-distribution data. First, an adaptive sphere detection method is proposed to detect data distributions in the dataset. The data is partitioned in terms of the identified data distributions, and the corresponding SVDD classifiers are constructed separately. Substantial experiments on both artificial and real-world datasets have demonstrated that the proposed approach outperforms original SVDD.
Zhao, Y., Zhang, H., Cao, L., Zhang, C. & Bohlscheid, H. 2008, 'Combined Pattern Mining: from Learned Rules to Actionable Knowledge', AI 2008: Advances in Artificial Intelligence: Lecture Notes in Artificial Intelligence 5360, Australasian Joint Conference on Artificial Intelligence, Springer, Auckland, Newzealand, pp. 393-403.
View/Download from: UTS OPUS or Publisher's site
Association mining often produces large collections of association rules that are difficult to understand and put into action. In this paper, we have designed a novel notion of combined patterns to extract useful and actionable knowledge from a large amount of learned rules. We also present definitions of combined patterns, design novel metrics to measure their interestingness and analyze the redundancy in combined patterns. Experimental results on real-life social security data demonstrate the effectiveness and potential of the proposed approach in extracting actionable knowledge from complex data.
Moemeng, C., Cao, L. & Zhang, C. 2008, 'F-TRADE 3.0: An Agent-Based Integrated Framework for Data Mining Experiments', 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Workshops, IEEE/WIC/ACM international Conference on Web Intelligence and Intelligent Agent Technology, IEEE Computer Society, University of Technology, Sydney, Australia, pp. 612-615.
View/Download from: UTS OPUS or Publisher's site
Data mining researches focus on algorithms that mine valuable patterns from particular domain. Apart from the theoretical research, experiments take a vast amount of effort to build. In this paper, we propose an integrated framework that utilises a multi-agent system to support the researchers to rapidly develop experiments. Moreover, the proposed framework allows extension and integration for future researches in mutual aspects of agent and data mining. The paper describes the details of the framework and also presents a sample implementation.
Zhao, Y., Zhang, H., Cao, L., Zhang, C. & Bohlscheid, H. 2008, 'Efficient Mining of Event-Oriented Negative Sequential Rules', 2008 IEEE/WIC/ACM International Conference on Web Intelligence, IEEE/WIC/ACM international Conference on Web Intelligence and Intelligent Agent Technology, IEEE Computer Society, University of Technology, Sydney, Australia, pp. 336-342.
View/Download from: UTS OPUS or Publisher's site
Traditional sequential pattern mining deals with positive sequential patterns only, that is, only frequent sequential patterns with the appearance of items are discovered. However, it is often interesting in many applications to find frequent sequential patterns with the non-occurrence of some items, which are referred to as negative sequential patterns. This paper analyzes three types of negative sequential rules and presents a new technique to find event-oriented negative sequential rules. Its effectiveness and efficiency are shown in our experiments.
Liu, B., Cao, L., Yu, P. & Zhang, C. 2008, 'Multi-Space-Mapped SVMs for Multi-Class Classification', 2008 Eighth IEEE International Conference on Data Mining, IEEE International Conference on Data Mining, IEEE, Pisa, Italy, pp. 911-916.
View/Download from: UTS OPUS or Publisher's site
In SVMs-based multiple classification, it is not always possible to find an appropriate kernel function to map all the classes from different distribution functions into a feature space where they are linearly separable from each other. This is even worse if the number of classes is very large. As a result, the classification accuracy is not as good as expected. In order to improve the performance of SVMs-based multi-classifiers, this paper proposes a method, named multi-space-mapped SVMs, to map the classes into different feature spaces and then classify them. The proposed method reduces the requirements for the kernel function. Substantial experiments have been conducted on One-against-All, One-against-One, FSVM, DDAG algorithms and our algorithm using six UCI data sets. The statistical results show that the proposed method has a higher probability of finding appropriate kernel functions than traditional methods and outperforms others.
Zhang, H., Zhao, Y., Cao, L. & Zhang, C. 2008, 'Combined Association Rule Mining', Lecture Notes in Artificial Intelligence Vol 5012: Advances in Knowledge Discovery and Data Mining, Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer, Osaka, Japan, pp. 1069-1074.
View/Download from: UTS OPUS or Publisher's site
This paper proposes an algorithm to discover novel association rules, combined association rules. Compared with conventional association rule, this combined association rule allows users to perform actions directly. Combined association rules are always organized as rule sets, each of which is composed of a number of single combined association rules. These single rules consist of non-actionable attributes, actionable attributes, and class attribute, with the rules in one set sharing the same non-actionable attributes. Thus, for a group of objects having the same non-actionable attributes, the actions corresponding to a preferred class can be performed directly. However, standard association rule mining algorithms encounter many difficulties when applied to combined association rule mining, and hence new algorithms have to be developed for combined association rule mining. In this paper, we will focus on rule generation and interestingness measures in combined association rule mining. In rule generation, the frequent itemsets are discovered among itemset groups to improve efficiency. New interestingness measures are defined to discover more actionable knowledge. In the case study, the proposed algorithm is applied into the field of social security. The combined association rule provides much greater actionable knowledge to business owners and users.
Ou, Y., Cao, L., Luo, C. & Zhang, C. 2008, 'Domain-Driven Local Exceptional Pattern Mining for Detecting Stock Price Manipulation', Lecture Notes in Computer Science Vol 5351: PRICAI 2008: Trends in Artificial Intelligence, Pacific Rim International Conference on Artificial Intelligence, Springer, Hanoi,Vietnam, pp. 849-858.
View/Download from: UTS OPUS or Publisher's site
Recently, a new data mining methodology, Domain Driven Data Mining (D3M), has been developed. On top of data-centered pattern mining, D3M generally targets the actionable knowledge discovery under domain-specific circumstances. It strongly appreciates the involvement of domain intelligence in the whole process of data mining, and consequently leads to the deliverables that can satisfy business user needs and decision-making. Following the methodology of D3M, this paper investigates local exceptional patterns in real-life microstructure stock data for detecting stock price manipulations. Different from existing pattern analysis mainly on interday data, we deal with tick-by-tick data. Our approach proposes new mechanisms for constructing microstructure order sequences by involving domain factors and business logics, and for measuring the interestingness of patterns from business concern perspective. Real-life data experiments on an exchange data demonstrate that the outcomes generated by following D3M can satisfy business expectations and support business users to take actions for market surveillance.
Zhu, X., Zhang, P., Wu, X., He, D., Zhang, C. & Shi, Y. 2008, 'Cleansing Noisy Data Streams', 2008 Eighth IEEE International Conference on Data Mining, IEEE International Conference on Data Mining, IEEE Computer Society, Pisa, Italy, pp. 1139-1144.
View/Download from: UTS OPUS or Publisher's site
We identify a new research problem on cleansing noisy data streams which contain incorrectly labeled training examples. The objective is to accurately identify and remove mislabeled data, such that the prediction models built from the cleansed streams can be more accurate than the ones trained from the raw noisy streams. For this purpose, we first use bias-variance decomposition to derive a maximum variance margin (MVM) principle for stream data cleansing. Following this principle, we further propose a local and global filtering (LgF) framework to combine the strength of local noise filtering (within one single data chunk) and global noise filtering (across a number of adjacent data chunks) to identify erroneous data. Experimental results on six data streams (including two real-world data streams) demonstrate that LgF significantly outperforms simple methods in identifying noisy examples.
Luo, C., Zhao, Y., Cao, L., Ou, Y. & Zhang, C. 2008, 'Exception Mining on Multiple Time Series in Stock Market', 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, IEEE/WIC/ACM international Conference on Web Intelligence and Intelligent Agent Technology, Springer, Sydney, Australia, pp. 690-693.
View/Download from: UTS OPUS or Publisher's site
This paper presents our research on exception mining on multiple time series data which aims to assist stock market surveillance by identifying market anomalies. Traditional technologies on stock market surveillance have shown their limitations to handle large amount of complicated stock market data. In our research, the Outlier Mining on Multiple time series (OMM) is proposed to improve the effectiveness of exception detection for stock market surveillance. The idea of our research is presented, challenges on the research are analyzed, and potential research directions are summarized.
Chen, Q., Zhang, C. & Zhang, S. 2008, 'Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics): Introduction', Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), pp. 1-231.
View/Download from: Publisher's site
Zhang, C., Zhu, X., Zhang, J., Qin, Y. & Zhang, S. 2007, 'GBKII: An Imputation Method for Missing Values', Advances in Knowledge Discovery and Data Mining, Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer, Nanjing, China, pp. 1080-1087.
View/Download from: UTS OPUS or Publisher's site
Missing data imputation is an actual and challenging issue in machine learning and data mining. This is because missing values in a dataset can generate bias that affects the quality of the learned patterns or the classification performances. To deal with this issue, this paper proposes a Grey-Based K-NN Iteration Imputation method, called GBKII, for imputing missing values. GBKII is an instance-based imputation method, which is referred to a non-parametric regression method in statistics. It is also efficient for handling with categorical attributes. We experimentally evaluate our approach and demonstrate that GBKII is much more efficient than the k-NN and mean-substitution methods.
Ou, Y., Cao, L., Yu, T. & Zhang, C. 2007, 'Detecting Turning Points of Trading Price and Return Volatility for Market', Workshop on Agents & Data Mining Interaction (ADMI 2007), International Workshop on Agents and Data Mining Interaction, IEEE Computer Soc, San Jose, pp. 491-494.
View/Download from: UTS OPUS or Publisher's site
Trading agent concept is very useful for trading strategy design and market mechanism design. In this paper, we introduce the use of trading agent for market surveillance. Market surveillance agents can be developed for market surveillance officers and management teams to present them alerts and indicators of abnormal market movements. In particular, we investigate the strategies for market surveillance agents to detect the impact of company announcements on market movements. This paper examines the performance of segmentation on the time series of trading price and return volatility, respectively. The purpose of segmentation is to detect the turning points of market movements caused by announcements, which are useful to identify the indicators of insider trading. The experimental results indicate that the segmentation on the time series of return volatility outperforms that on the time series of trading price. It is easier to detect the turning points of return volatility than the turning points of trading price. The results will be used to code market surveillance agents for them to monitor abnormal market movements before the disclosure of market sensitive announcements. In this way, the market surveillance agents can assist market surveillance officers with indicators and alerts.
Zhao, Y., Zhang, H., Figueiredo, F., Cao, L. & Zhang, C. 2007, 'Mining for combined association rules on multiple datasets', Proceedings of the 2007 international workshop on Domain driven data mining, International Workshop on Domain Driven Data Mining, ACM, San Jose, USA, pp. 18-23.
View/Download from: UTS OPUS
Many organisations have their digital information stored in a distributed systems structure scheme, be it in different locations, using vertically and horizontally distributed repositories, which brings about an high level of complexity to data mining. From a classical data mining view, where the algorithms expect a denormalised structure to be able to operate on, heterogeneous data sources, such as static demographic and dynamic transactional data are to be manipulated and integrated to the extent commercial association rules algorithms can be applied. Bearing in mind the usefulness and understandability of the application from a business perspective, combined rules of multiple patterns derived from different repositories, containing historical and point in time data, were used to produce new techniques in association mining applied to debt recovery. Initially debt repayment patterns were discovered using transactional data and class labels defined by domain expertise, then demographic patterns were attached to each of the class labels. After combining the patterns, two type of rules were discovered leading to different results: 1) same demographic pattern with different repayment patterns, and 2) same repayment pattern with different demographic patterns. The rules produced are interesting, valuable, complete and understandable, which shows the applicability and effectiveness of the new method.
Xu, Z., Zhang, C., Zhang, S., Song, W. & Yang, B. 2007, 'Efficient Attribute Reduction Based on Discernibility Matrix', Rough Sets and Knowledge Technology, Rough Sets and Knowledge Technology, Springer, Canada, pp. 13-21.
View/Download from: UTS OPUS
To reduce the time complexity of attribute reduction algorithm based on discernibility matrix, a simplified decision table is first introduced, and an algorithm with time complexity (| O C||U|) is designed for calculating the simplified decision table. And then, a new measure of the significance of an attribute is defined for reducing the search space of simplified decision table. A recursive algorithm is proposed for computing the attribute significance that its time complexity is of O(|U/C|). Finally, an efficient attribute reduction algorithm is developed based on the attribute significance. This algorithm is equal to existing algorithms in performance and its time complexity is O(|C||U|) +O(|C|2|U/C|)
Cao, L., Luo, C. & Zhang, C. 2007, 'Agent-Mining Interaction: An Emerging Area', Autonomous Intelligent Systems: Multi-Agents and Data Mining, International Workshop Autonomous Intelligent Systems: Agents and Data Mining, Springer, St. Petersburg, Russia, pp. 60-73.
View/Download from: UTS OPUS or Publisher's site
In the past twenty years, agents (we mean autonomous agent and multi-agent systems) and data mining (also knowledge discovery) have emerged separately as two of most prominent, dynamic and exciting research areas. In recent years, an increasingly remarkable trend in both areas is the agent-mining interaction and integration. This is driven by not only researcherâs interests, but intrinsic challenges and requirements from both sides, as well as benefits and complementarity to both communities through agent-mining interaction. In this paper, we draw a high-level overview of the agent-mining interaction from the perspective of an emerging area in the scientific family. To promote it as a newly emergent scientific field, we summarize key driving forces, originality, major research directions and respective topics, and the progression of research groups, publications and activities of agent-mining interaction. Both theoretical and application-oriented aspects are addressed. The above investigation shows that the agent-mining interaction is attracting everincreasing attention from both agent and data mining communities. Some complicated challenges in either community may be effectively and efficiently tackled through agent-mining interaction. However, as a new open area, there are many issues waiting for research and development from theoretical, technological and practical perspectives. This work is sponsored by Australian Research Council Discovery Grant (DP0773412, LP0775041, DP0667060, DP0449535), and UTS internal grants.
Zhang, J., Zhang, S., Zhu, X., Wu, X. & Zhang, C. 2007, 'Measuring the Uncertainty of Differences for Contrasting Groups', Proceedings of the Twenty-Second AAAI Conference on Artificial Intelligence, National Conference of the American Association for Artificial Intelligence, AAAI Press, Vancouver, Canada, pp. 1920-1921.
View/Download from: UTS OPUS
In this paper, we propose an empirical likelihood (EL) based strategy for building confidence intervals for differences between two contrasting groups. The proposed method can deal with the situations when we know little prior knowledge about the two groups, which are referred to as non-parametric situations. We experimentally evaluate our method on UCI datasets and observe that proposed EL based method outperforms other methods.
Zhang, S., Zhu, X., Zhang, J. & Zhang, C. 2007, 'Cost-Time Sensitive Decision Tree with Missing Values', Knowledge Science, Engineering and Management, International Conference on Knowledge Science, Engineering and Management, Springer, Melboure, pp. 447-459.
View/Download from: UTS OPUS or Publisher's site
Cost-sensitive decision tree learning is very important and popular in machine learning and data mining community. There are many literatures focusing on misclassification cost and test cost at present. In real world application, however, the issue of time-sensitive should be considered in cost-sensitive learning. In this paper, we regard the cost of time-sensitive in cost-sensitive learning as waiting cost (referred to WC), a novelty splitting criterion is proposed for constructing cost-time sensitive (denoted as CTS) decision tree for maximal decrease the intangible cost. And then, a hybrid test strategy that combines the sequential test with the batch test strategies is adopted in CTS learning. Finally, extensive experiments show that our algorithm outperforms the other ones with respect to decrease in misclassification cost.
Cao, L. & Zhang, C. 2007, 'F-Trade: An Agent-Mining Symbiont for Financial Services', Agent & Data Mining Interaction, International Conference on Autonomous Agents and Multiagent Systems, IFAAMAS, Honolulu, Hawai'i, pp. 1363-1364.
View/Download from: UTS OPUS
The interaction and integration of agent technology and data mining presents prominent benefits to solve some of challenging issues in individual areas. For instance, data mining can enhance agent learning, while agent can benefit data mining with distributed pattern discovery. In this paper, we summarize the main functionalities and features of an agent service and data mining symbiont -- F-Trade. The F-Trade is constructed in Java agent service following the theory of open complex agent systems. We demonstrate the roles of agents in building up the F-Trade, as well as how agents can support data mining. On the other hand, data mining is used to strengthen agents. F-Trade provides flexible and efficient services of trading evidence back-testing, optimization and discovery, as well as plug and play of algorithms, data and system modules for financial trading and surveillance with online connectivity to huge quantities of global market data. and mining symbiont.
Cao, L., Luo, C. & Zhang, C. 2007, 'Developing Actionable Trading Strategies for Trading Agents', Proceedings of the IEEE/WIC/ACM International Conference on Intellligent Agent Technology, IEEE/WIC/ACM International Conference on Intelligent Agent Technology, IEEE Computer Soc, San Jose, pp. 72-75.
View/Download from: UTS OPUS or Publisher's site
Trading agents are very useful for developing and backtesting quality trading strategies for actions taking in the real world. However, the existing trading agent research mainly focuses on simulation using artificial data and market models. As a result, the actionable capability of developed trading strategies is often limited. In this paper, we analyze such constraints on developing actionable trading strategies for trading agents. These points are deployed into developing a series of trading strategies for trading agents through optimizing, and enhancing actionable trading strategies. We demonstrate working case studies in large-scale of market data. These approaches and their performance are evaluated from both technical and business perspectives.
Zhu, X., Zhang, S., Zhang, J. & Zhang, C. 2007, 'Cost-Sensitive Imputing Missing Values with Ordering', Proceedings of the Twenty-Second AAAI Conference on Artificial Intelligence, National Conference of the American Association for Artificial Intelligence, AAAI Press, Vancouver, Canada, pp. 1922-1923.
View/Download from: UTS OPUS
Various approaches for dealing with missing data have been developed so far. In this paper, two strategies are proposed for cost-sensitive iterative imputing missing values with optimal ordering. Experimental results demonstrate that proposed strategies outperform the existing methods in terms of imputation cost and accuracy.
Wang, J., Zhang, C., Wu, X., Qi, H. & Wang, J. 2003, 'SVM-OD: SVM method to detect outliers', Foundations And Novel Approaches In Data Mining, 2nd Workshop on Foundations and New Directions of Data Mining, Springer-Verlag Berlin, Melbourne, FL, pp. 129-141.
View/Download from: UTS OPUS
Outlier detection is an important task in data mining because outliers can be either useful knowledge or noise. Many statistical methods have been applied to detect outliers, but they usually assume a given distribution of data and it is difficult to dea
Chen, Q., Chen, P.Y., Zhang, S. & Zhang, C. 2006, 'Detecting collusion attacks in security protocols', Frontiers Of WWW Research And Development - Apweb 2006, Proceedings - Lecture Notes in Computer Science, Asia Pacific Web Conference, Springer-Verlag Berlin, Harbin, China, pp. 297-306.
View/Download from: UTS OPUS or Publisher's site
Security protocols have been widely used to safeguard secure electronic transactions. We usually assume that principals are credible and shall not maliciously disclose their individual secrets to someone else. Nevertheless, it is impractical to completel
Cao, L. & Zhang, C. 2006, 'Domain-driven actionable knowledge discovery in the real world', Advances In Knowledge Discovery And Data Mining, Proceedings, Lecture Notes in Artificial Intelligence, Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer-Verlag Berlin, Singapore, pp. 821-830.
View/Download from: UTS OPUS or Publisher's site
Actionable knowledge discovery is one of Grand Challenges in KDD. To this end, many methodologies have been developed. However, they either view data mining as an autonomous data-driven trial-and-error process, or only analyze the issues in an isolated a
Zhang, C. & Cao, L. 2006, 'Domain-driven mining: Methodologies and applications', Advances in Intelligent IT: Active Media Technology 2006, International Conference on Active Media, IOS Press, Brisbane, Australia, pp. 13-16.
View/Download from: UTS OPUS
Chen, Q., Chen, P.Y., Zhang, S. & Zhang, C. 2006, 'Detecting Collusion in Security protocols', Frontiers of WWW Research and Development, Asia Pacific Web Conference, Springer, Harbin, China, pp. 297-306.
View/Download from: UTS OPUS
Chen, Q., Chen, P.Y., Zhang, C. & Li, L. 2006, 'Mining frequent itemsets for protein kinase regulation', PRICAI 2006: Trends in Artificial Intelligence, Pacific Rim International Conference on Artificial Intelligence, Springer, Guilin, China, pp. 222-230.
View/Download from: UTS OPUS or Publisher's site
Protein kinases, a family of enzymes, have been viewed as an important signaling intermediary by living organisms for regulating critical biological processes such as memory, hormone response and cell growth. The unbalanced kinases are known to cause cancer and other diseases. With the increasing efforts to collect, store and disseminate information about the entire kinase family, it not only leads to valuable data set to understand cell regulation but also poses a big challenge to extract valuable knowledge about metabolic pathway from the data. Data mining techniques that have been widely used to find frequent patterns in large datasets can be extended and adapted to kinase data as well. This paper proposes a framework for mining frequent itemsets from the collected kinase dataset. An experiment using AMPK regulation data demonstrates that our approaches are useful and efficient in analyzing kinase regulation data.
Zhang, C., Qin, Y., Zhu, X., Zhang, J. & Zhang, S. 2006, 'Clustering-based missing value imputation for data pre-processing', Industrial Informatics, 2006 IEEE International Conference, IEEE International Conference on Industrial Informatics, IEEE, Singapore, pp. 1081-1086.
View/Download from: UTS OPUS
Zhao, Y., Zhang, C. & Zhang, S. 2006, 'Efficient frequent itemsets mining by sampling', Advances in Intelligent IT: Active Media Technology 2006, International Conference on Active Media Technology, IOS Press, Brisbane, Australia, pp. 112-117.
View/Download from: UTS OPUS
Zhang, S., Qin, Y., Zhu, X., Zhang, J. & Zhang, C. 2006, 'Kernel-based multi-imputation for missing data', Advances in intelligent IT, International Conference on Active Media, IOS Press, Brisbane, Australia, pp. 106-111.
View/Download from: UTS OPUS
Cao, L., Luo, D. & Zhang, C. 2006, 'Fuzzy genetic algorithms for pairs mining', PRICAI 2006: Trends In Artificial Intelligence, Proceedings, Lecture Notes in Artificial Intelligence, Pacific Rim International Conference on Artificial Intelligence, Springer-Verlag Berlin, Guilin, China, pp. 711-720.
View/Download from: UTS OPUS or Publisher's site
Pairs mining targets to mine pairs relationship between entities such as between stocks and markets in financial data mining. It has emerged as a kind of promising data mining applications. Due to practical complexities in the real-world pairs mining suc
Ni, J. & Zhang, C. 2006, 'A dynamic storage method for stock transaction data', Proceedings of the 2nd International Conference on Computational Intelligence, IASTED International Conference on Computational Intelligence, ACTA Press, San Francisco, USA, pp. 338-342.
View/Download from: UTS OPUS
Ni, J. & Zhang, C. 2006, 'A human-friendly MAS for mining stock data', Proceedings of 2006 IEEE/ACM/WIC International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT 2006 Workshops), IEEE/WIC/ACM international Conference on Web Intelligence and Intelligent Agent Technology, IEEE Computer Society press, Hong Kong, China, pp. 19-22.
View/Download from: UTS OPUS or Publisher's site
Mining stock data can be beneficial to the participants and researchers in the stock market. However, it is very difficult for a normal trader or researcher to apply data mining techniques to the data on his own due to the complexity involved in the whole data mining process. In this paper, we present a multi-agent system that can help users easily deal with their data mining jobs on stock data. This system guides users to specify their mining tasks by simply specifying the data sets to be mined and selecting pre-defined and/or user-added data mining agents. This approach offers normal traders a practical and flexible solution to mining stock data.
Zhao, Y., Zhang, C., Zhang, S. & Zhao, L. 2006, 'Adapting K-Means Algorithm for Discovering Clusters in Subspaces', Frontiers of WWW Research and Development - APWb 2006, Asia Pacific Web Conference, Springer, Habin, China, pp. 53-62.
View/Download from: UTS OPUS
Ni, J., Cao, L. & Zhang, C. 2006, 'Agent services-oriented architectural design of a framework for artificial stock markets', Advances in Intelligent IT: Active Media Technology 2006, International Conference on Active Media Technology, IOS Press, Brisbane, Australia, pp. 396-399.
View/Download from: UTS OPUS
Zhao, Y., Zhang, C. & Zhang, S. 2006, 'Enhancing DWT for recent-biased dimension reduction of time', AI 2006: Advances in Artificial Intelligence, Australasian Joint Conference on Artificial Intelligence, Springer, Hobart, Australia, pp. 1048-1053.
View/Download from: UTS OPUS or Publisher's site
In many applications, old data in time series become less important as time elapses, which is a big challenge to traditional techniques for dimension reduction. To improve Discrete Wavelet Transform (DWT) for effective dimension reduction in this kind of applications, a new method, largest-latest-DWT, is designed by keeping the largest k coefficients out of the latest w coefficients at each level of DWT transform. Its efficiency and effectiveness is demonstrated by our experiments.
Qin, Z., Zhang, S. & Zhang, C. 2006, 'Missing or absent? A question in Cost-sensitive Decision Tree', Advances in intelligent IT, IEEE ACtive Media Technology, IOS Press, Brisbane, Australia, pp. 118-126.
View/Download from: UTS OPUS
Zhao, Y., Cao, L., Morrow, Y.K., Ou, Y., Ni, J. & Zhang, C. 2006, 'Discovering debtor patterns of Centrelink customers', Data mining 2006; Proceedings of AusDM 2006, Australian Data Mining Conference, ACS Inc, Sydney, Australia, pp. 135-144.
View/Download from: UTS OPUS
Ni, J. & Zhang, C. 2006, 'Mining better technical trading strategies with genetic algorithms', Proceedings of the International Workshop on Integrating AI and Data Mining, International Workshop on Integrating AI and Data Mining, IEEE Computer Society, Hobart, Australia, pp. 26-33.
View/Download from: UTS OPUS or Publisher's site
Technical analysis is one of the two main schools of thought in the analysis of security prices. It is widely believed and applied by many professional and amateur traders. However, it is often criticized for lacking scientific rigour or worse, for lacking any basis whatsoever. We propose to explore the feasibility and/or limitation of technical analysis by the optimization of technical trading strategies over historical stock data with genetic algorithms. This paper presents the optimization problem in detail and discusses the potential problems to be tackled during the optimization. Preliminary experiments show that it can identify the limitations quickly.
Zhang, S., Qin, Y., Zhu, X., Zhang, J. & Zhang, C. 2006, 'Optimized parameters for missing data imputation', PRICAI 2006: 9th Pacific rim international conference on artificial intelligence, Pacific Rim International Conference on Artificial Intelligence, Springer, Guilin, China, pp. 1010-1016.
View/Download from: UTS OPUS
To complete missing values, a solution is to use attribute correlations within data. However, it is difficult to identify such relations within data containing missing values. Accordingly, we develop a kernel-based missing data imputation method in this
Zhang, S., Yu, J.X., Lu, J. & Zhang, C. 2006, 'Is frequency enough for decision makers to make decisions?', Advances in Knowledge Discovery and Data Mining, Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer, Singapore, pp. 499-503.
View/Download from: UTS OPUS or Publisher's site
There are many advanced techniques that can efficiently mine frequent itemsets using a minimum support. However, the question that remains unanswered is whether the minimum support can really help decision makers to make decisions. In this paper, we stud
Wang, J. & Zhang, C. 2006, 'Dynamic Focus Strategies for Electronic Trade Execution in Limit Order', Proceedings Joint Conference 8th IEEE International Conference on E-Commerce and Technology (CEC 2006); 3rd IEEE International Conference on Enterprise Computing, E-Commerce and E-Services (EEE 2006), IEEE Conference on Electronic Commerce Technology, IEEE, San Francisco, California, USA, pp. 1-8.
View/Download from: UTS OPUS or Publisher's site
Trade execution has attracted lots of attention from academia and financial industry due to its significant impact on investment return. Recently, limit order strategies for trade execution were backtested on historical order/trade data and dynamic price adjustment was proposed to respond state variables in execution. This paper emphasizes the effect of dynamic volume adjustment on limit order strategies and proposes dynamic focus (DF) strategies, which incorporate a series of market orders of different volume into the limit order strategy and dynamically adjusts their volume by monitoring state variables such as inventory and order book imbalance in real-time. The sigmoid function is suggested as the quantitative model to represent the relationship between the state variables and the volume to be adjusted. The empirical results on historical order/trade data of the Australian Stock Exchange show that the DF strategy can outperform the limit order strategy, which does not adopt dynamic volume adjustment.
Zhang, C., Chen, F., Wu, X. & Zhang, S. 2006, 'Identifying bridging rules between conceptual clusters', International Conference on Knowledge Discovery and Data Mining, ACM International Conference on Knowledge Discovery and Data Mining, ACM Press, Philadelphia, USA, pp. 815-820.
View/Download from: UTS OPUS or Publisher's site
A bridging rule in this paper has its antecedent and action from different conceptual clusters. We first design two algorithms for mining bridging rules between clusters in a database, and then propose two non-linear metrics for measuring the interestingness of bridging rules. Bridging rules can be distinct from association rules (or frequent itemsets). This is because (1) bridging rules can be generated by infrequent itemsets that are pruned in association rule mining; and (2) bridging rules are measured by the importance that includes the distance between two conceptual clusters, whereas frequent itemsets are measured by only the support.
Lin, L., Cao, L. & Zhang, C. 2005, 'Genetic algorithms for robust optimization in financial applications', Proceedings Of The Iasted International Conference On Computational Intelligence, IASTED International Conference on Computational Intelligence, ACTA Press, Calgary, Canada, pp. 387-391.
View/Download from: UTS OPUS
In stock market or other financial market systems, the technical trading rules are used widely to generate buy and sell alert signals. In each rule, there are many parameters. The users often want to get the best signal series from the in-sample sets, (H
Zhang, C. & Zhang, S. 2005, 'In-depth data mining and its application in stock market', Lecture Notes In Artificial Intelligence, 1st International Conference on Advanced Data Mining and Applications, Springer, Wuhan, China, pp. 13-13.
N/A
Cao, L., Zhang, C. & Ni, J. 2005, 'Agent services-oriented architectural design of open complex agent systems', Proceedings of 2005 IEEE/WIC/ACM International Conference On Intelligent Agent Technology, IEEE/WIC/ACM International Conference on Intelligent Agent Technology, IEEE, Compiegne, France, pp. 120-123.
View/Download from: UTS OPUS or Publisher's site
Architectural design is a critical phase in building agent-based systems. However, most of existing agent-oriented software engineering approaches deliver weak or incomplete supports for the architectural design of distributed and especially Internet-based agent systems. On the other hand, the emergence of service-oriented computing (SOC) brings in intrinsic mechanisms for complementing agent-based computing (ABC). In this paper, we investigate the dialogue between ABC and SOC and their integration in implementing architectural design. We synthesize them and develop the computational concept agent service, and build a new design approach called agent service-oriented architectural design (ASOAD). The ASOAD expands the contents and ranges of agent and ABC, and synthesize the qualities of SOC such as interoperability and openness with the performances of ABC like flexibility and autonomy. It is suitable for designing distributed agent systems and agent service-based enterprise application integration.
Ni, J. & Zhang, C. 2005, 'An efficient implementation of the backtesting of trading strategies', Parallel and Distributed Processing and Applications, IEEE International Symposium on Parallel and Distributed Processing with Applications, Springer, Nanjing, China, pp. 126-131.
View/Download from: UTS OPUS or Publisher's site
Some trading strategies are becoming more and more complicated and utilize a large amount of data, which makes the backtesting of these strategies very time consuming. This paper presents an efficient implementation of the backtesting of such a trading s
Cao, L., Schurmann, R. & Zhang, C. 2005, 'Domain-Driven In-Depth pattern Discovery: A Practical Methodology', Proceedings 4th Australasion Data Mining Conference AusDM05, Australian Data Mining Conference, The University of Technology, Sydney, Sydney, Australia, pp. 101-114.
View/Download from: UTS OPUS
Lin, L., Cao, L. & Zhang, C. 2005, 'The Fish-Eye Visualization of foreign Currency Exchange Data Streams', Asia-Pacific Symposium on Information Visualisation 2005, Asia-Pacific Symposium on Information Visualisation, ACS, Sydney, Australia, pp. 91-96.
View/Download from: UTS OPUS
In a foreign currency exchange market, there are high-density data streams. The present approaches for visualization of this type of data cannot show us a figure with targeted both local details and global trend information. In this paper, on the basis of features and attributes of foreign currency exchange trading streams, we discuss and compare multiple approaches including interactive zooming, multiform sampling with combination of attribute of large foreign currency exchange data, and fish-eye view embedded visualization for visual display of high-density foreign currency exchange transactions. By comparison, Fish-eye-based visualization is the best option, which can display regional records in details without losing global movement trend in the market in a limited display window. We used Fish-eye technology for output visualization of foreign currency exchange trading strategies in our trading support system linking to real-time foreign currency market closing data:
Lin, L., Cao, L. & Zhang, C. 2005, 'The Visualization of Large Database in Stock Markets', Proceedings of the IASTED International Conference on Databases and Applications, IASTED International Multi Conference, ACTA Press, Innsbruck, Austria, pp. 163-166.
View/Download from: UTS OPUS
Chen, Q., Yi Ping, C., Zhang, C. & Zhang, S. 2005, 'A Framework for Merging Inconsistent Belief in Security Protocol Analysis', Proceedings of the 2005 International Workshop on Data engineering Issues in E-Commerce, International Workshop on Data engineering Issues in E-Commerce, IEEE, Tokyo, Japan, pp. 119-124.
View/Download from: UTS OPUS or Publisher's site
This paper proposes a framework for merging inconsistent beliefs in the analysis of security protocols. The merge application is a procedure of computing the inferred beliefs of message sources and resolving the conflicts among the sources. Some security properties of secure messages are used to ensure the correctness of authentication of messages. Several instances are presented, and demonstrate our method is useful in resolving inconsistent beliefs in secure messages.
Ni, A., Zhu, X. & Zhang, C. 2005, 'Any-Cost Discovery: Learning Optimal classification Rules', AI 2005: Advances in Artificial Intelligence, 18th Australian Joint Conference on Artificial Intelligence Proceedings, Australasian Joint Conference on Artificial Intelligence, Springer, Sydney, Australia, pp. 123-132.
View/Download from: UTS OPUS or Publisher's site
Fully taking into account the hints possibly hidden in the absent data, this paper proposes a new criterion when selecting attributes for splitting to build a decision tree for a given dataset. In our approach, it must pay a certain cost to obtain an attribute value and pay a cost if a prediction is error. We use different scales for the two kinds of cost instead of the same cost scale defined by previous works. We propose a new algorithm to build decision tree with null branch strategy to minimize the misclassification cost. When consumer offers finite resources, we can make the best use of the resources as well as optimal results obtained by the tree. We also consider discounts in test costs when groups of attributes are tested together. In addition, we also put forward advice about whether it is worthy of increasing resources or not. Our results can be readily applied to real-world diagnosis tasks, such as medical diagnosis where doctors must try to determine what tests should be performed for a patient to minimize the misclassification cost in certain resources
Cao, L., Zhang, C., Luo, D., Chen, W. & Zamari, N. 2004, 'Integrative Early Requirements Analysis for Agent-Based Software', Fourth International Conference on Hybrid Intelligent Systems HIS-2004, International Conference on Hybrid Intelligent Systems, IEEE Computer Society Press, Kitakyushu, Japan, pp. 1-6.
View/Download from: UTS OPUS or Publisher's site
Early requirements analysis (ERA) is quite significant for building agent-based systems. Goal-oriented requirements analysis is promising for the agent-oriented early requirements analysis. In general, either visual modeling or formal specifications is u
Luo, D., Luo, C. & Zhang, C. 2005, 'A framework for relational link discovery', Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), pp. 1311-1314.
View/Download from: Publisher's site
Link discovery is an emerging research direction for extracting evidences and links from multiple data sources. This paper proposes a self-organizing framework for discovering links from multi-relational databases. It includes main functional modules for developing adaptive data transformers and representation specification, multi-relational feature construction, and self-organizing multi-relational correlation and link discovery algorithms. © Springer-Verlag Berlin Heidelberg 2005.
Qin, Z., Zhang, S. & Zhang, C. 2004, 'Cost-Sensitive Decision Trees with Multiple Cost Scales', AI 2004: Advances in Artificial Intelligence, 17th Australian Joint Conference on Artificial Intelligence Cairns, Australia, December 2004 Proceedings, Australasian Joint Conference on Artificial Intelligence, Springer, Cairns, Australia, pp. 380-390.
View/Download from: UTS OPUS or Publisher's site
How to minimize misclassification errors has been the main focus of Inductive learning techniques, such as CART and C4.5. However, misclassification error is not the only error in classification problem. Recently, researchers have begun to consider both test and misclassification costs. Previous works assume the test cost and the misclassification cost must be defined on the same cost scale. However, sometimes we may meet difficulty to define the multiple costs on the same cost scale. In this paper, we address the problem by building a cost-sensitive decision tree by involving two kinds of cost scales, that minimizes the one kind of cost and control the other in a given specific budget. Our work will be useful for many diagnostic tasks involving target cost minimization and resource consumption for obtaining missing information
Lin, L. & Zhang, C. 2004, 'The application of Fuzzy Sets in finding the Best Stock-Rule Pairs', Proceedings if the 5th International Conference on RASC, International Conference on Recent Advances in Soft Computing, Nottingham Trent University, Nottingham, UK, pp. 472-476.
View/Download from: UTS OPUS
Lin, L., Cao, L., Wang, J. & Zhang, C. 2004, 'The Applications of genetic algorithms in stock market data mining optimisation', Data Mining V, Data Mining, Text Mining and Their Business Application, Conference on Data Mining, Text Mining and Their Business Application, Wessex Institute of Technology Press, Malaga, Spain, pp. 273-280.
View/Download from: UTS OPUS
Cao, L., Wang, J., Lin, L. & Zhang, C. 2004, 'Agent Services - Based infrastructure for online assessment of training strategies', Proceedings IEEE/WIC/ACM International Conference on Intelligent Agent Technology (IAT 2004), IEEE/WIC/ACM International Conference on Intelligent Agent Technology, IEEE, Beijing, China, pp. 345-348.
View/Download from: UTS OPUS or Publisher's site
Traders and researchers in stock marketing often hold some private trading strategies. Evaluation and optimization of their strategies is a great benefit to them before they take any risk in realistic trading. We build an agent services-driven infrastructure: F-TRADE. It supports online plug in, iterative back-test, and recommendation of trading strategies. We propose agent services-driven approach for building the above automated enterprise infrastructure. Description, directory and mediation of agent services are discussed. System structure of the agent services-based F-TRADE is also discussed. F-TRADE has been an online test platform for research and application of multi-agent technology, and data mining in stock markets
Cao, L., Luo, C., Luo, D. & Zhang, C. 2004, 'Hybrid Strategy of Analysis and Control of Telecommunications Frauds', Proceedings of 2nd International Conference on Information Technology and Applications, International Conference on Information Technology and Applications, IEEE, Harbin, China, pp. 11-15.
View/Download from: UTS OPUS
Cao, L., Ni, J., Wang, J. & Zhang, C. 2004, 'Agent Services-Driven-plug-and-play in F-Trade', AI 2004: Advances in Artificial Intelligence, 17th Australian Joint Conference on Artificial Intelligence Cairns, Australia, December 2004 Proceedings, Australasian Joint Conference on Artificial Intelligence, Springer, Cairns, Australia, pp. 917-922.
View/Download from: UTS OPUS or Publisher's site
We have built an agent service-based enterprise infrastructure: F-TRADE. With its online connectivity to huge real stock data in global markets, it can be used for online evaluation of trading strategies and data mining algorithms. The main functions in the F-TRADE include soft plug-and-play, and back-testing, optimization, integration and evaluation of algorithms. In this paper, we'll focus on introducing the intelligent plug-and-play, which is a key system function in the F-TRADE. The basic idea for the soft plug-and-play is to build agent services which can support the online plug-in of agents, algorithms and data sources. Agent UML-based modeling, role model and agent services for the plug-and-play are discussed. With this design, algorithm providers, data source providers, and system module developers of the F-TRADE can expand system functions and resources by online plugging them into the F-TRADE.
Cao, L., Luo, C., Luo, D. & Zhang, C. 2004, 'Integration of Business Intelligence Based on Three-Level Ontology Services', IEEE/WIC/ACM International Conference on Web Intelligence (WI2004), IEEE/WIC/ACM international Conference on Web Intelligence and Intelligent Agent Technology, IEEE, Beijing, China, pp. 17-23.
View/Download from: UTS OPUS or Publisher's site
Usually, integration of business intelligence (BI) from realistic telecom enterprise is by packing data warehouse (DW), OLAP, data mining and reporting from different vendors together.As a result, BI system users are transferred to a reporting system with reports, data models, dimensions and measures predefined by system designers.As a result of survey, 85% of DW projects failed to meet their intended objectives.In this paper, we investigate how to integrate BI packages into an adaptive and flexible knowledge portal by constructing an internal link and communication channel from top-level business concepts to underlying enterprise information systems (EIS).An approach of three-level ontology services is developed, which implements unified naming, directory and transport of ontology services, and ontology mapping and query parsing among conceptual view, analytical view and physical view from user interfaces through DW to EIS.Experiments on top of real telecom EIS shows that our solution for integrating BI presents much stronger power to support operational decision making more user-friendly and adaptively compared with those simply combining BI products presently available together.
Li, C., Zhang, Z. & Zhang, C. 2004, 'A Platform for Dynamic Organisation of Agents in Agent-Based Systems', Proceedings of the IEEE/WIC/ACM International Conference on Intelligent Technology, IEEE/WIC/ACM International Conference on Intelligent Agent Technology, IEEE, Beijing, China, pp. 454-457.
View/Download from: UTS OPUS or Publisher's site
In most agent-based systems, different middle agents are employed to increase their flexibility. However, there are still three issues remain unsolved: In centralized architecture with single middle agent, the middle agent itself is a bottleneck and suffers from single point failure; Middle agents in distributed architecture lack capability of dynamic organization of agents; The reliability is not strong because of the single point failure and lack of effective architecture. In this paper, we introduce a platform with ring architectural model to solve all above problems. In the platform, multiple middle agents are dynamically supported for solving the first problem. For solving the second problem, middle agents dynamically manage the registration and cancellation of service provider agents and application teams, each of which includes a set of closely interacting requester agents to complete an independent task. Redundancy middle agent technique is proposed for solving the third problem. All middle agents are of the feature of proliferation and self-cancellation according to the sensory inputs from their environment. For organizing the middle agents effectively, a ring architectural model is proposed. We demonstrate the applicability of the platform by its application and present experimental evidence that the platform is flexible and robust.
Li, C., Song, Q. & Zhang, C. 2004, 'MA-IDS Architecture for Distributed Intrusion Detection using Mobile Agent', Proceedings of 2nd International Conference on Information Technology and Applications, International Conference on Information Technology and Applications, IEEE, Harbin, China, pp. 451-455.
View/Download from: UTS OPUS
Li, C., Cheng, D. & Zhang, C. 2004, 'A Platform to Integrate well-log Information Application on heterogeneous Environments', Proceedings of 2nd International Conference on Information Technology and Applications, International Conference on Information Technology and Applications, IEEE, Harbin, China, pp. 265-270.
View/Download from: UTS OPUS
Lin, L., Jiang, Y., Fan, Z. & Zhang, C. 2004, 'Dynamic Value-Based Diagnosis System for Assembler Program', 15th International Workshop on Principle of Diagnosis DX04, International Workshop on Principles of Diagnosis, LAAS-CNRS, Carcassonne, France, pp. 209-213.
View/Download from: UTS OPUS
Zhao, Y., Zhang, C. & Shen, Y. 2004, 'Clustering High-Dimensional Data with Low-Order Neighbours', Proceedings IEEE/WIC/ACM International Conference on Web Intelligence (WI 2004), IEEE/WIC/ACM international Conference on Web Intelligence and Intelligent Agent Technology, IEEE, Beijing, China, pp. 103-109.
View/Download from: UTS OPUS or Publisher's site
Density-based and grid-based clustering are two main clustering approaches. The former is famous for its capability of discovering clusters of various shapes and eliminating noises, while the latter is well known for its high speed. Combination of the two approaches seems to provide better clustering results. To the best of our knowledge, however, all existing algorithms that combine density-based clustering and grid-based clustering take cells as atomic units, in the sense that either all objects in a cell belong to a cluster or no object in the cell belong to any cluster. This requires the cells to be small enough to ensure the fine resolution of results. In high-dimensional spaces, however, the number of cells can be very large when cells are small, which would make the clustering process extremely costly. On the other hand, the number of neighbors of a cell grows exponentially with the dimensionality of datasets, which makes the complexity increase further. In this paper, we present a new approach that takes objects (or points) as the atomic units, so that the restriction of cell size can be relaxed without degrading the resolution of clustering results. In addition, a concept of ith-order neighbors is introduced to avoid considering the exponential number of neighboring cells. By considering only low-order neighbors, our algorithm is very efficient while losing only a little bit of accuracy. Experiments on synthetic and public data show that our algorithm can cluster high-dimensional data effectively and efficiently.
Zhao, Y., Zhang, C. & Zhang, S. 2004, 'Discovering interesting association rules by clustering', 17th Australian Joint Conference on Artificial Intelligence Proceedings, Australasian Joint Conference on Artificial Intelligence, Springer, Cairns,Australia, pp. 1055-1061.
View/Download from: UTS OPUS or Publisher's site
There are a great many metrics available for measuring the interestingness of rules. In this paper, we design a distinct approach for identifying association rules that maximizes the interestingness in an applied context. More specifically, the interestingness of association rules is defined as the dissimilarity between corresponding clusters. In addition, the interestingness assists in filtering out those rules that may be uninteresting in applications. Experiments show the effectiveness of our algorithm.
Wang, J. & Zhang, C. 2004, 'Support Vector Machines based on set covering', Proceedings of 2nd International Conference on Information Technology and Applications, International Conference on Information Technology and Applications, IEEE, Harbin, China, pp. 181-184.
View/Download from: UTS OPUS
Zhang, G., Bai, C., Lu, J. & Zhang, C. 2004, 'Bayesian Network based Cost Benefit Factor Inference in Eservices', Proceedings of 2nd International Conference on Information Technology and Applications, International Conference on Information Technology and Applications, Macquarie Scientific Publishing, Harbin, China, pp. 464-469.
View/Download from: UTS OPUS
Wang, J. & Zhang, C. 2004, 'KBSVM: KMeans-based SVM for business intelligence', Proceedings of the 10th AMCIS, Americas Conference on Information Systems, Association for Information Systems, New York, USA, pp. 1889-1893.
View/Download from: UTS OPUS
Chen, Q., Zhang, C. & Zhang, S. 2004, 'A verification model for electronic transaction protocols', Advanced Web Technologies and Applications - APWEB'04, Asia Pacific Web Conference, Springer-Verlag Berlin, Hangzhou,China, pp. 824-833.
View/Download from: UTS OPUS or Publisher's site
Electronic transaction protocols have been found with subtle flaws. Recently, model checking has been used to verify electronic transaction protocols for the limitations of low efficiency and error prone in the traditional approaches. This paper proposes an extendable verification model to especially validate electronic transaction protocols. In particular, the verification model is able to deal with the inconsistency in transmitted messages. Thus, we can measure the incoherence in secure messages coming from different sources and at different moments and ensure the validity of verification result. We analyze two instances by using this model. The analyses uncover some subtle flaws in the protocols.
Li, C., Song, Q. & Zhang, C. 2004, 'MA-IDS architecture for distributed intrusion detection using mobile agents', Proceedings of the Second International Conference on Information Technology and Applications (ICITA 2004), pp. 162-166.
Distributed intrusion detection systems (IDS) have many advantages such as scalability, subversion resistance, and graceful service degradation. However, there are some impediments when they are implemented. The mobile agent (MA) technology is of many features to suit the Implementation of distributed IDS. In this paper, we propose a novel architecture MA-IDS with MA technology for distributed IDS. MA-IDS employs MA technology to coordlnately process information from each monitored host, and then completes global information extraction of Intruder actions. A prototype of mobile agent-based distributed Intrusion detection system by following MA-IDS Is developed. The system also Introduces uncertain factor into Intrusion decision, which accords with the objective reality that human behavior is changeful. We demonstrate the advantages and the potentials of MA-IDS by the result of evaluation.
Zhang, G., Bai, C., Lu, J. & Zhang, C. 2004, 'Bayesian network based cost benefit factor inference in e-services', Proceedings of the Second International Conference on Information Technology and Applications (ICITA 2004), pp. 404-409.
This paper applies Bayesian network technique to model and inference the uncertain relationships among cost factors and benefit factors in E-services. A cost-benefit factor-relation model proposed in our previous study is considered as domain knowledge and the data collected through a survey is as evidence to conduct inference. Through calculating conditional probability distribution among factors and conducting inference, this paper identifies that certain cost factors are significantly more Important than others to certain benefit factors. In particular, this study found that 'increased investment in maintaining E-services' would significantly contribute to 'enhancing perceived company image' and 'gaining competitive advantages', and 'increased investment In staff training' would significant contribute to 'realizing business strategies'. These results have the potential to improve the strategic planning of companies by determining more effective investment areas and adopting more suitable development activities where Eservices are concerned.
Cao, L., Luo, C., Luo, D. & Zhang, C. 2004, 'Hybrid strategy of analysis and control of telecommunications frauds', Proceedings of the Second International Conference on Information Technology and Applications (ICITA 2004), pp. 281-285.
The problem of telecommunications frauds has been getting more and more serous for many years, and is even getting more and more worse not only in western countries but also in some developing countries. Detection, Analysis and prevention mechanisms are emerging both from telecommunications operators and academia. In this paper, we present a hybrid strategy of analysis and control of telecommunications frauds from engineering viewpoint Our first task is to identify the complexity of telecommunications frauds, we discuss possible fraud scenarios and their evolution. Furthermore, in order to build an information system to deal with realistic telecommunications frauds, we summarize and propose a hybrid strategy, which includes a solution package, five models and four types of analyses, to construct a loop-dosed system for analysis and control of frauds. We further discuss a system framework for analysis and control of telecommunications frauds.
Lin, L., Ling, H., Lu, J., Zhang, C., Song, L. & Xue, H. 2003, 'Case-based Reasoning Integrating with Direct-Case-Linkage for Tacit Knowledge Management', Proceedings of the Seventh Pacific Asia Conference on Information Systems, Pacific Asia Conference on Information Systems, University of South Australia, Adelaide, Australia, pp. 1724-1733.
View/Download from: UTS OPUS
Li, C., Zhang, C. & Cao, L. 2003, 'Theoretical Evaluation of Ring-Based Architectural Model for Middle Agents in Agent-Based System', Foundations of Intelligent Systems. 14th Symposium, ISMIS 2003 Proceedings, International Symposium on Foundations of Intelligent Systems, Springer-Verlag Berlin Heidelberg, Maebashi City, Japan, pp. 603-607.
View/Download from: UTS OPUS or Publisher's site
Ring-based architectural model is usually employed to promote the scalability and robustness of agent-based systems. However there are no criteria for evaluating the performance of ring-based architectural model. In this paper, we introduce an evaluation approach to comparing the performance of ring-based architectural model with other ones. In order to evaluate ring-based architectural model, we proposed an application-based information-gathering system with middle agents, which are organized with ring-based architectural model and solve the matching problem between service provider agents and requester agents. We evaluate the ring-based architectural model with performance predictability, adaptability, and availability. We demonstrate the potentials of ring-based architectural model by the results of evaluation.
Wang, J., Zhang, C., Wu, X., Qi, H. & Wang, J. 2003, 'SVM-OD: a New SVM Algorithm for Outlier Detection', Foundations and New Directions in Data Mining Workshop Notes, IEEE International Conference on Data Mining, IEEE, Melbourne, Australia, pp. 203-209.
Zhang, Z. & Zhang, C. 2003, 'Building Agent-Based Hybrid Intelligent Systems', Design and Application of Hybrid Intelligent Systems, HIS03 the Third International Conference on Hybrid Intelligent Systems, International Conference on Hybrid Intelligent Systems, IOS Press, Melbourne, Australia, pp. 799-808.
View/Download from: UTS OPUS
Many complex problems including financial investment planning, foreign exchange trading, knowledge discovery from large/multiple databases require hybrid intelligent systems that integrate many intelligent techniques including expert systems, fuzzy logic, neural networks, and genetic algorithms. However, hybrid intelligent systems are difficult to develop because they have a large number of parts or components that have many interactions. On the other hand, agents offer a new and often more appropriate route to the development of complex systems, especially in open and dynamic environments. In this paper, it is argued that agent technology is well snited for constructing hybrid intelligent systems (especially loosely coupled hybrid intelligent systems) through a successful case study. A great number of heterogeneous computing techniques/packages are easily integlated into the experimental system under a unifying agent framework, which implies that agent technology can greatly facilitate the construction of hybrid intelligent systems
Cheng, X., Ouyang, D. & Zhang, C. 2003, 'A Logic Framework with Algebraic Extension', Proceedings of the 25th International Conference on Information Technology Interfaces, International Conference on Information Technology Interfaces, SRCE University Computing Centre,, Cavtat, Croatia, pp. 633-638.
View/Download from: UTS OPUS
We propose a many-sorted general framework to incorporate algebraic computation with logical reasoning, which equally encompasses following systems as special cases: lattice-valued fuzzy logic, operator fuzzy logic, operator fuzzy logic for belief, operator fuzzy logic for argumentation, fuzzy logic, probabilistic logic, annotated logic, language of signed formulas, autoepistemic logic.
Yan, X., Zhang, C. & Zhang, S. 2003, 'A Database-Independent Strategy for Confidence Determination', Proceedings of the Fifteenth International Conference on Software Engineering and Knowledge Engineering, International Conference on Software Engineering and Knowledge Engineering, Knowledge Systems Institute, San Francisco, California, USA, pp. 621-625.
View/Download from: UTS OPUS
Yan, X., Zhang, C. & Zhang, S. 2003, 'A Database-Independent Approach of Mining Association Rules with Genetic Algorithm', Intelligent Data Engineering and Automated learning. 4th International Conference, IDEAL 2003, International Conference on Intelligent Data Engineering and Automated Learning, Springer-Verlag Berlin Heidelberg, Hong Kong, China, pp. 882-886.
View/Download from: UTS OPUS or Publisher's site
Apriori-like algorithms for association rules mining rely upon the minimum support and the minimum confidence. Users often feel hard to give these thresholds. On the other hand, genetic algorithm is effective for global searching, especially when the searching space is so large that it is hardly possible to use deterministic searching method. We try to apply genetic algorithm to the association rules mining and propose an evolutionary method. Computations are conducted, showing that our ARMGA model can be used for the automation of the association rule mining systems, and the ideas given in this paper are effective.
Cheng, X., Ouyang, D. & Zhang, C. 2003, 'A General Model-based Diagnosis', Proceedings of the 25th International Conference on Information Technology Interfaces, International Conference on Information Technology Interfaces, SRCE University Computing Centre, University of Zagreb, Cavtat, Croatia, pp. 627-632.
View/Download from: UTS OPUS
A general method for model-based diagnosis is developed, which can handle multiple faulty modes, and will enable users to analyze the completeness of the system model, and to choose the observation subset appropriately, in order to have small diagnostic space with the right solutions in. The existent consistency-based diagnosis and abductive diagnosis are special cases of this method. The relationship between the diagnostic procedure and corresponding prime implication is analyzed for implementation.
Chen, Q., Zhang, C. & Zhang, S. 2003, 'Verifying the Payment Authorization in SET Protocol', Intelligent Data Engineering and Automated Learning. 4th International Conference, IDEAL 2003, International Conference on Intelligent Data Engineering and Automated Learning, Springer-Verlag Berlin Heidelberg, Hong Kong, pp. 914-918.
View/Download from: UTS OPUS or Publisher's site
The Secure Electronic Transaction (SET) protocol is a protocol designed to conduct safe business over Internet. We present formal verification of the Payment Authorization in SET by using ENDL (extension of non-monotonic logic) [1]. The analysis uncovers some subtle defects that may incur malicious attacks. To overcome these vulnerabilities, some feasible countermeasures are proposed accordingly
Yan, X., Zhang, C. & Zhang, S. 2003, 'Identifying Frequent Terms in Text Databases by Association Semantics', International Conference on Information Technology: Coding and Computing (ITCC 2003), International Conference on Information Technology: Coding and Computing, IEEE, Las Vegas, Nevada, USA, pp. 672-675.
View/Download from: UTS OPUS or Publisher's site
Existing information retrieval methods are mainly based on either term similarly or latent semantics. To reduce irrelevant information searched, this paper presents a new approach for information retrieval by applying methodology of association rules mining to a test database. Association semantics among terms of a document and a query are considered, such that the semantci similarity between the document and query may be reduced if they are somewhat irrelevant.
Li, C., Zhang, C., Wang, M. & Song, Q. 2003, 'An Approach to Digitizing and Managing Well-Logging Parameter Graphs and Agent-Based Perspective', 2003 IEEE/WIC International Conference on Intelligent Agent Technology, IEEE/WIC/ACM International Conference on Intelligent Agent Technology, IEEE Computer Society, Halifax, Canada, pp. 11-17.
View/Download from: UTS OPUS or Publisher's site
The curves on well-logging parameter graph are very important to reservoir description. But how to keep the parameter graphs permanently and use the data implied in those curves efficiently are still open problems in petroleum industry. In this paper, we contribute an approach to digitizing well log curves and storing the digitized information into Oracle database or data file server with multi-agent perspective. We employed Gaia methodology and open agent architecture to analyze and design the system. According to the characteristics of the well-logging parameter graphs, we implement the SCTR (Scanning, Compressing, Tracing, and Rectifying) algorithms with four agents to digitize well log curves. Two data management agents are developed for operating database and data files in uniform agent communication language. The experimental results show that this approach is effective.
Li, C., Song, Q., Wang, M. & Zhang, C. 2003, 'SCTR: An Approach to Digitizing Well-Logging Graph', Proceedings of the Sixth IASTED International Conference on Computer Graphics and Imaging, Sixth IASTED International Conference on Computer Graphics and Imaging, ACTA Press, Honolulu, Hawaii, USA, pp. 285-288.
Li, C., Zhang, C. & Wang, M. 2003, 'An Agent-based Framework for Well-Logging Information Management', Proceedngs of the Second International Conference on Active Media Technology, International Conference on Active Media Technology, World Scientific Publishing Co. Pte. Ltd., Chongqing, China, pp. 114-119.
View/Download from: UTS OPUS
Li, C., Zhang, C. & Wang, M. 2003, 'An Agent-Based Curve-Digitizing System for Well-Logging Data Management', Proceedings ITCC 2003 International Conference on Information Technology: Coding and Computing, International Conference on Information Technology: Coding and Computing, IEEE Computer Society, Las Vegas, Nevada, USA, pp. 656-660.
Li, C., Zhang, C., Chen, Q. & Zhang, Z. 2003, 'A Scalable and Robust Framework for Agent-Based Heterogeneous Database Operation', Proceedings 2003 International Conference on Intelligent Agents, Web Technologies and Internet Commerce - IAWTIC 2003, International Conference on Intelligent Agents, Web Technologies, and Internet Commerce, University of Canberra, Vienna, Austria, pp. 260-271.
View/Download from: UTS OPUS
Yan, X., Zhang, C., Zhang, S. & Qin, Z. 2003, 'Indexing by Conditional Association Semantics', Information Technology and Organizations: Trends, Issues, Challenges and Solutions Volume 1, International Conference on Information Resources Management, Idea Group Publishing, Philadelphia, Pennsylvania, USA, pp. 691-693.
View/Download from: UTS OPUS or Publisher's site
Prevailing information retrieval methods are based on either term similarity or latent semantics. Terms are considered independently. This paper presents a new strategy for information retrieval, i.e., indexing by conditional association semantics. In our approach, the conditional association semantics of terms will be considered during semantics indexing.
Chen, Q., Zhang, C., Zhang, S. & Li, C. 2003, 'Verifying the Purchase Request in SET Protocol', Web Technologies and Applications. 5th Asia-Pacific Web Conference, APWeb2003 Proceedings (Lecture Notes in Computer Science Vol 2642), Asia Pacific Web Conference, Springer-Verlag Berlin Heidelberg, Xi'an, China, pp. 263-274.
View/Download from: UTS OPUS or Publisher's site
The Secure Electronic Transaction (SET) protocol has been jointly developed by Visa and MasterCard toward achieving secure online-transactions. This paper presents formal verification of the Purchase Request phase of SET, by using ENDL (extension of non-monotonic logic). The analysis unveils some potential flaws. To overcome these vulnerabilities, some feasible countermeasures are proposed accordingly during the validation. Also, the modelling of Purchase Request is described to implement the mechanically model checking instead of manual verification
Cao, L., Luo, D., Luo, C. & Zhang, C. 2003, 'Systematic Engineering in Designing Architecture of Telecommunications Business Intelligence System', Design and Application of Hybrid Intelligent Systems, HIS03, the Third International Conference on Hybrid Intelligent Systems, International Conference on Hybrid Intelligent Systems, IOS Press, Melbourne, Australia, pp. 1084-1093.
View/Download from: UTS OPUS
Cao, L., Luo, C., Li, C., Zhang, C. & Dai, R.W. 2003, 'Open Giant Intelligent Information Systems and Its Agent-Oriented Abstraction Mechanism', Proceedings of the Fifteenth International Conference on Software Engineering and Knowledge Engineering, International Conference on Software Engineering and Knowledge Engineering, Knowledge Systems Institute, San Francisco, California, USA, pp. 85-89.
View/Download from: UTS OPUS
Cao, L., Li, C., Zhang, C. & Dai, R.W. 2003, 'Open Giant Intelligent Information Systems and Its Multiagent-Oriented System Design', Proceedings of the International Conference on Software Engineering Research and Practice Volume II, International Conference on Software Engineering Research and Practice, CSREA Press, Las Vegas, Nevada, USA, pp. 816-822.
View/Download from: UTS OPUS
Li, C., Zhang, C. & Zhang, Z. 2003, 'A Ring-Based Architectural Model for Middle Agents in Agent-Based System', Intelligent Data Engineering and Automated Learning, 4th International Conference, IDEAL 2003, Revised Papers, International Conference on Intelligent Data Engineering and Automated Learning, Springer-Verlag Berlin Heidelberg, Hong Kong, pp. 94-98.
View/Download from: UTS OPUS or Publisher's site
In agent-based systems, the performance of middle agents not only relies on the matchmaking algorithms employed by them, but also the architecture that organizes them with suitable organizational structure and coordination mechanism. In this paper, we contribute a framework and develop a couple of middle agents with logical ring organizational structure to match requester agents with service provider agents. The middle agent is of the features of proliferation and self-cancellation according to the sensory input from its environment. The token-based coordination mechanism of middle agents is designed. Two kinds of middle agents, namely, host and duplicate, are designed for promoting the scalability and robustness of agent-based systems. We demonstrate the potentials of the architecture by case study.
Chen, Q. & Zhang, C. 2002, 'The Verification Logic for Secure Transaction Protocols', Proceedings International Conference on Intelligent Information Technology, ICIIT 2002, Peoples Posts and Telecommunications Publishing House, China, pp. 326-333.
View/Download from: UTS OPUS
Li, C., Zhang, C. & Zhang, Z. 2002, 'An Agent Based Middleware for Uniform Operation in a Heterogeneous Database Environment', Proceedings of the 4th Asia-pacific Conference on Simulated Evolution and Learning, Asia-Pacific Conference on Simulated Evolution and Learning, Nanyang Technological University, Singapore, pp. 385-389.
View/Download from: UTS OPUS
Niu, L., Yan, X., Zhang, C. & Zhang, S. 2002, 'Product Hierarchy-based Customer Profiles for Electronic Commerce Recommendation', Proceedings of 2002 International Conference on Machine Learning and Cybernetics, International Conference on Machine Learning and Cybernetics, IEEE Publication, Beijing, China, pp. 1075-1080.
View/Download from: UTS OPUS
Zhang, C., Yan, X., Zhang, S. & Kennedy, P.J. 2002, 'Mining Very Large Databases Using Software Agents', Proceedings of the International Conference on Machine Learning and Application (ICMLA 02), International Conference on Machine Learning and Application (ICMLA 02), CSREA Press, Las Vegas, USA, pp. 84-90.
View/Download from: UTS OPUS
Zhang, C. & Zhang, S. 2002, 'Database Clustering for Mining Multi-Databases', Proceedings of the 11th IEEE International Conference on Fuzzy Systems, IEEE International Conference on Fuzzy Systems, IEEE Press, Hawaii, USA, pp. 974-980.
View/Download from: UTS OPUS
Zhang, Z. & Zhang, C. 2002, 'An Improved Matchmaking Algorithm For Middle Agents', Proceedings of the 1st International Joint Conference on Autonomous Agents and Multi-Agent Systems, 1st International Joint Conference on Autonomous Agents and Multi-Agent Systems, ACM Press, Bologna, Italy, pp. 1340-1347.
View/Download from: UTS OPUS
Zhang, C., Li, C. & Zhang, Z. 2002, 'An Agent-Based Framework for Petroleum Information Services from Distributed Heterogeneous Data Resources', Ninth Asia-Pacific Software Engineering Conference APSEC 2002, Asia-Pacific Software Engineering Conference, IEEE Computer Society, Gold Coast, Australia, pp. 593-602.
View/Download from: UTS OPUS or Publisher's site
For making good decisions in the area of petroleum production, it is becoming a big problem how to timely gather sufficient and correct information, which may be stored in databases, data files, or on the World Wide Web. In this paper, Gaia methodology and Open Agent Architecture were employed to contribute a framework to solve above problem. The framework consists of three levels, namely, role mode, agent type, and agent instance. The model with five roles is analyzed. Four agent types are designed Six agent instances are developed for constructing the system of petroleum information services. The experimental results show that all agents in the system can work cooperatively to organize and retrieve relevant petroleum information. The successful implementation of the framework shows that agent-based technology can significantly facilitate the construction of complex systems in distributed heterogeneous data resource environment.
Zhang, C., Zhang, S., Yan, X. & Qin, Z. 2002, 'Mind the Trends when you Mine: Incremental Data Mining', Proceedings of the 1st International Conference on Fuzzy Systems and Knowledge Discovery, International Conference on Fuzzy Systems and Knowledge, Nanyang Technological University, Singapore, pp. 476-480.
View/Download from: UTS OPUS
Zhang, Z. & Zhang, C. 2002, 'An Agent-Based Hybrid Intelligent System for Financial Investment Planning', PRICAI 2002: Trends in Artificial Intelligence, Proceeding of 7th Pacific Rim International Conference on Artificial Intelligence, Pacific Rim International Conference on Artificial Intelligence, Springer-Verlag, Tokyo, Japan, pp. 355-364.
View/Download from: UTS OPUS
Zhang, Z. & Zhang, C. 2002, 'An Application of Fuzzy Cluster Analysis to Matchmaking in Middle Agents', Proceedings of the 1st International Conference on Fuzzy Systems and Knowledge Discovery, International Conference on Fuzzy Systems and Knowledge, Nanyang Technological University, Singapore, pp. 566-570.
View/Download from: UTS OPUS
Chen, Q. & Zhang, C. 2002, 'Using ENDL to verify Cardholder Registration in SET Protocol', Proceedings of the International Conference on E-Business, International Conference on E-Business, Beijing Institute of Technology Press, Beijing, China, pp. 616-623.
View/Download from: UTS OPUS
Li, C., Zhang, C. & Zhang, Z. 2002, 'An Agent-Based Intelligent System for Information Gathering from World Wide Web Environment', Proceedings of 2002 International Conference on Machine Learning and Cybernetics, International Conference on Machine Learning and Cybernetics, IEEE Press, Beijing, China, pp. 1852-1857.
View/Download from: UTS OPUS or Publisher's site
To use the vast amount of information efficiently and effectively from Web sites is very important for making informed decisions. There are, however, still many problems that need to be overcome in the information gathering research arena to enable the delivery of relevant information required by users. In this paper, an information gathering system is develop by means of multiple agents to solve those problems. We employed some ideas of Gaia's methodology and an open agent architecture to analyze and design the system. The system consists of a query preprocessing agent, information retrieval agent, information filtering agent, and information management agent. The filtering agent is trained with categorized documents and can provide users with the necessary information. The experimental results show that all agents in the system can work cooperatively to retrieve relevant information from the World Wide Web environment.
Wu, X., Zhang, C. & Zhang, S. 2002, 'Mining Both Positive and Negative Association Rules', Proceedings of the 19th International Conference on Machine Learning, 19th International Conference on Machine Learning, Morgan Kaufmann, Sydney, Australia, pp. 658-665.
View/Download from: UTS OPUS
Yan, X., Zhang, C., Zhang, S. & Debenham, J.K. 2002, 'Association Rule Mining by Agents', Proceedings of International Conference on Machine Learning and Applications, International Conference on Machine Learning and Applications, CSREA Press, Las Vegas, USA, pp. 77-83.
View/Download from: UTS OPUS
Zhang, C., Zhang, S., Yan, X. & Qin, Z. 2002, 'Identifying exceptional patterns in multi-databases', Proceedings of the 1st International Conference on Fuzzy Systems and Knowledge Discovery, International Conference on Fuzzy Systems and Knowledge, Nanyang Technological University, Singapore, pp. 146-150.
View/Download from: UTS OPUS

Journal articles

Wu, J., Hong, Z., Pan, S., Zhu, X., Cai, Z. & Zhang, C. 2016, 'Multi-graph-view subgraph mining for graph classification', Knowledge and Information Systems.
View/Download from: UTS OPUS or Publisher's site
© 2015 Springer-Verlag London In this paper, we formulate a new multi-graph-view learning task, where each object to be classified contains graphs from multiple graph-views. This problem setting is essentially different from traditional single-graph-view graph classification, where graphs are collected from one single-feature view. To solve the problem, we propose a cross graph-view subgraph feature-based learning algorithm that explores an optimal set of subgraphs, across multiple graph-views, as features to represent graphs. Specifically, we derive an evaluation criterion to estimate the discriminative power and redundancy of subgraph features across all views, with a branch-and-bound algorithm being proposed to prune subgraph search space. Because graph-views may complement each other and play different roles in a learning task, we assign each view with a weight value indicating its importance to the learning task and further use an optimization process to find optimal weight values for each graph-view. The iteration between cross graph-view subgraph scoring and graph-view weight updating forms a closed loop to find optimal subgraphs to represent graphs for multi-graph-view learning. Experiments and comparisons on real-world tasks demonstrate the algorithm's superior performance.
Wu, J., Pan, S., Zhu, X., Zhang, P. & Zhang, C. 2016, 'SODE: Self-Adaptive One-Dependence Estimators for classification', Pattern Recognition, vol. 51, pp. 358-377.
View/Download from: UTS OPUS or Publisher's site
© 2015 Elsevier Ltd. SuperParent-One-Dependence Estimators (SPODEs) represent a family of semi-naive Bayesian classifiers which relax the attribute independence assumption of Naive Bayes (NB) to allow each attribute to depend on a common single attribute (superparent). SPODEs can effectively handle data with attribute dependency but still inherent NB's key advantages such as computational efficiency and robustness for high dimensional data. In reality, determining an optimal superparent for SPODEs is difficult. One common approach is to use weighted combinations of multiple SPODEs, each having a different superparent with a properly assigned weight value (i.e., a weight value is assigned to each attribute). In this paper, we propose a self-adaptive SPODEs, namely SODE, which uses immunity theory in artificial immune systems to automatically and self-adaptively select the weight for each single SPODE. SODE does not need to know the importance of individual SPODE nor the relevance among SPODEs, and can flexibly and efficiently search optimal weight values for each SPODE during the learning process. Extensive experiments and comparisons on 56 benchmark data sets, and validations on image and text classification, demonstrate that SODE outperforms state-of-the-art weighted SPODE algorithms and is suitable for a wide range of learning tasks. Results also confirm that SODE provides an appropriate balance between runtime efficiency and accuracy.
Pan, S., Wu, J., Zhu, X., Zhang, C. & Yu, P.S. 2016, 'Joint Structure Feature Exploration and Regularization for Multi-Task Graph Classification', IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, vol. 28, no. 3, pp. 715-728.
View/Download from: Publisher's site
Zhang, P., He, J., Long, G., Huang, G. & Zhang, C. 2016, 'Towards anomalous diffusion sources detection in a large network', ACM Transactions on Internet Technology, vol. 16, no. 1.
View/Download from: UTS OPUS or Publisher's site
© 2016 ACM.Witnessing the wide spread of malicious information in large networks, we develop an efficient method to detect anomalous diffusion sources and thus protect networks from security and privacy attacks. To date, most existing work on diffusion sources detection are based on the assumption that network snapshots that reflect information diffusion can be obtained continuously. However, obtaining snapshots of an entire network needs to deploy detectors on all network nodes and thus is very expensive. Alternatively, in this article, we study the diffusion sources locating problem by learning from information diffusion data collected from only a small subset of network nodes. Specifically, we present a new regression learning model that can detect anomalous diffusion sources by jointly solving five challenges, that is, unknown number of source nodes, few activated detectors, unknown initial propagation time, uncertain propagation path and uncertain propagation time delay. We theoretically analyze the strength of the model and derive performance bounds. We empirically test and compare the model using both synthetic and real-world networks to demonstrate its performance.
Pan, S., Wu, J., Zhu, X., Long, G. & Zhang, C. 2016, 'Boosting for graph classification with universum', Knowledge and Information Systems, pp. 1-25.
View/Download from: UTS OPUS or Publisher's site
© 2016 Springer-Verlag London Recent years have witnessed extensive studies of graph classification due to the rapid increase in applications involving structural data and complex relationships. To support graph classification, all existing methods require that training graphs should be relevant (or belong) to the target class, but cannot integrate graphs irrelevant to the class of interest into the learning process. In this paper, we study a new universum graph classification framework which leverages additional 'non-example graphs to help improve the graph classification accuracy. We argue that although universum graphs do not belong to the target class, they may contain meaningful structure patterns to help enrich the feature space for graph representation and classification. To support universum graph classification, we propose a mathematical programming algorithm, ugBoost, which integrates discriminative subgraph selection and margin maximization into a unified framework to fully exploit the universum. Because informative subgraph exploration in a universum setting requires the search of a large space, we derive an upper bound discriminative score for each subgraph and employ a branch-and-bound scheme to prune the search space. By using the explored subgraphs, our graph classification model intends to maximize the margin between positive and negative graphs and minimize the loss on the universum graph examples simultaneously. The subgraph exploration and the learning are integrated and performed iteratively so that each can be beneficial to the other. Experimental results and comparisons on real-world dataset demonstrate the performance of our algorithm.
Wu, J., Pan, S., Zhu, X., Zhang, C. & Wu, X. 2016, 'Positive and Unlabeled Multi-Graph Learning', IEEE Transactions on Cybernetics.
View/Download from: Publisher's site
In this paper, we advance graph classification to handle multi-graph learning for complicated objects, where each object is represented as a bag of graphs and the label is only available to each bag but not individual graphs. In addition, when training classifiers, users are only given a handful of positive bags and many unlabeled bags, and the learning objective is to train models to classify previously unseen graph bags with maximum accuracy. To achieve the goal, we propose a positive and unlabeled multi-graph learning (puMGL) framework to first select informative subgraphs to convert graphs into a feature space. To utilize unlabeled bags for learning, puMGL assigns a confidence weight to each bag and dynamically adjusts its weight value to select 'reliable negative bags. A number of representative graphs, selected from positive bags and identified reliable negative graph bags, form a 'margin graph pool which serves as the base for deriving subgraph patterns, training graph classifiers, and further updating the bag weight values. A closed-loop iterative process helps discover optimal subgraphs from positive and unlabeled graph bags for learning. Experimental comparisons demonstrate the performance of puMGL for classifying real-world complicated objects.
Pan, S., Wu, J., Zhu, X., Long, G. & Zhang, C. 2016, 'Task Sensitive Feature Exploration and Learning for Multitask Graph Classification', IEEE Transactions on Cybernetics.
View/Download from: Publisher's site
Multitask learning (MTL) is commonly used for jointly optimizing multiple learning tasks. To date, all existing MTL methods have been designed for tasks with feature-vector represented instances, but cannot be applied to structure data, such as graphs. More importantly, when carrying out MTL, existing methods mainly focus on exploring overall commonality or disparity between tasks for learning, but cannot explicitly capture task relationships in the feature space, so they are unable to answer important questions, such as what exactly is shared between tasks and what is the uniqueness of one task differing from others? In this paper, we formulate a new multitask graph learning problem, and propose a task sensitive feature exploration and learning algorithm for multitask graph classification. Because graphs do not have features available, we advocate a task sensitive feature exploration and learning paradigm to jointly discover discriminative subgraph features across different tasks. In addition, a feature learning process is carried out to categorize each subgraph feature into one of three categories: 1) common feature; 2) task auxiliary feature; and 3) task specific feature, indicating whether the feature is shared by all tasks, by a subset of tasks, or by only one specific task, respectively. The feature learning and the multiple task learning are iteratively optimized to form a multitask graph classification model with a global optimization goal. Experiments on real-world functional brain analysis and chemical compound categorization demonstrate the algorithm's performance. Results confirm that our method can be used to explicitly capture task correlations and uniqueness in the feature space, and explicitly answer what are shared between tasks and what is the uniqueness of a specific task.
Zhang, Q., Zhang, P., Long, G., Ding, W., Zhang, C. & Wu, X. 2016, 'Online learning from trapezoidal data streams', IEEE Transactions on Knowledge and Data Engineering, vol. 28, no. 10, pp. 2709-2723.
View/Download from: Publisher's site
© 1989-2012 IEEE.In this paper, we study a new problem of continuous learning from doubly-streaming data where both data volume and feature space increase over time. We refer to the doubly-streaming data as trapezoidal data streams and the corresponding learning problem as online learning from trapezoidal data streams. The problem is challenging because both data volume and data dimension increase over time, and existing online learning [1] , [2] , online feature selection [3] , and streaming feature selection algorithms [4] , [5] are inapplicable. We propose a new Online Learning with Streaming Features algorithm (OLSF for short) and its two variants, which combine online learning [1] , [2] and streaming feature selection [4] , [5] to enable learning from trapezoidal data streams with infinite training instances and features. When a new training instance carrying new features arrives, a classifier updates the existing features by following the passive-aggressive update rule [2] and updates the new features by following the structural risk minimization principle. Feature sparsity is then introduced by using the projected truncation technique. We derive performance bounds of the OL SF algorithm and its variants. We also conduct experiments on real-world data sets to show the performance of the proposed algorithms.
Wu, J., Pan, S., Zhu, X., Cai, Z., Zhang, P. & Zhang, C. 2015, 'Self-adaptive attribute weighting for Naive Bayes classification', EXPERT SYSTEMS WITH APPLICATIONS, vol. 42, no. 3, pp. 1487-1502.
View/Download from: UTS OPUS or Publisher's site
Pan, S., Wu, J., Zhu, X. & Zhang, C. 2015, 'Graph Ensemble Boosting for Imbalanced Noisy Graph Stream Classification', IEEE TRANSACTIONS ON CYBERNETICS, vol. 45, no. 5, pp. 940-954.
View/Download from: UTS OPUS or Publisher's site
Liu, H., Zhang, J., Ngo, H.H., Guo, W., Wu, H., Cheng, C., Guo, Z. & Zhang, C. 2015, 'Carbohydrate-based activated carbon with high surface acidity and basicity for nickel removal from synthetic wastewater', RSC Advances, vol. 5, no. 64, pp. 52048-52056.
View/Download from: UTS OPUS or Publisher's site
&copy; The Royal Society of Chemistry. The feasibility of preparing activated carbon (AC-CHs) from carbohydrates (glucose, sucrose and starch) with phosphoric acid activation was evaluated by comparing its physicochemical properties and Ni(ii) adsorption performance with a reference activated carbon (AC-PA) derived from Phragmites australis. The textural and chemical properties of the prepared activated carbon were characterized by N<inf>2</inf> adsorption/desorption isotherms, SEM, Boehm's titration and XPS. Although AC-CHs had much lower surface area (less than 700 m<sup>2</sup> g<sup>-1</sup>) than AC-PA (1057 m<sup>2</sup> g<sup>-1</sup>), they exhibited 45-70% larger Ni(ii) adsorption capacity which could be mainly attributed to their 50-75% higher contents of total acidic and basic groups. The comparison of XPS analyses for starch-based activated carbon before and after Ni(ii) adsorption indicated that Ni(ii) cation combined with the oxygen-containing groups and basic groups (delocalized -electrons) through the mechanisms of proton exchange, electrostatic attraction, and surface complexation. Kinetic results suggested that chemical reaction was the main rate-controlling step, and a very quick Ni(ii) adsorption performance of AC-CHs was presented with 95% of maximum adsorption within 30 min. Both adsorption capacity and rate of the activated carbon depended on the surface chemistry as revealed by batch adsorption experiments and XPS analyses. This study demonstrated that AC-CHs could be promising materials for Ni(ii) pollution minimization.
Pan, S., Wu, J., Zhu, X., Long, G. & Zhang, C. 2015, 'Finding the best not the most: regularized loss minimization subgraph selection for graph classification', PATTERN RECOGNITION, vol. 48, no. 11, pp. 3783-3796.
View/Download from: UTS OPUS or Publisher's site
Yin, H., Cui, B., Chen, L., Hu, Z. & Zhang, C. 2015, 'Modeling Location-Based User Rating Profiles for Personalized Recommendation', ACM Transactions on Knowledge Discovery from Data, vol. 9, no. 3, pp. 1-41.
View/Download from: Publisher's site
Fang, M., Yin, J., Zhu, X. & Zhang, C. 2015, 'TrGraph: Cross-Network Transfer Learning via Common Signature Subgraphs', IEEE Transactions on Knowledge and Data Engineering, vol. 27, no. 9, pp. 2536-2549.
View/Download from: UTS OPUS or Publisher's site
Bin Li, Xingquan Zhu, Ruijiang Li & Chengqi Zhang 2015, 'Rating Knowledge Sharing in Cross-Domain Collaborative Filtering', IEEE Transactions on Cybernetics, vol. 45, no. 5, pp. 1068-1082.
View/Download from: UTS OPUS or Publisher's site
Chen, Q., Luo, H., Zhang, C. & Chen, Y.P. 2015, 'Bioinformatics in protein kinases regulatory network and drug discovery.', Mathematical biosciences, vol. 262, pp. 147-156.
Protein kinases have been implicated in a number of diseases, where kinases participate many aspects that control cell growth, movement and death. The deregulated kinase activities and the knowledge of these disorders are of great clinical interest of drug discovery. The most critical issue is the development of safe and efficient disease diagnosis and treatment for less cost and in less time. It is critical to develop innovative approaches that aim at the root cause of a disease, not just its symptoms. Bioinformatics including genetic, genomic, mathematics and computational technologies, has become the most promising option for effective drug discovery, and has showed its potential in early stage of drug-target identification and target validation. It is essential that these aspects are understood and integrated into new methods used in drug discovery for diseases arisen from deregulated kinase activity. This article reviews bioinformatics techniques for protein kinase data management and analysis, kinase pathways and drug targets and describes their potential application in pharma ceutical industry.
Liang, G., Zhu, X. & Zhang, C. 2014, 'The effect of varying levels of class distribution on bagging for different algorithms: An empirical study', International Journal of Machine Learning and Cybernetics, vol. 5, no. 1, pp. 63-71.
View/Download from: UTS OPUS or Publisher's site
Many real world applications involve highly imbalanced class distribution. Research into learning from imbalanced class distribution is considered to be one of ten challenging problems in data mining research, and it has increasingly captured the attention of both academia and industry. In this work, we study the effects of different levels of imbalanced class distribution on bagging predictors by using under-sampling techniques. Despite the popularity of bagging in many real-world applications, some questions have not been clearly answered in the existing research, such as the effect of varying the levels of class distribution on different bagging predictors, e.g., whether bagging is superior to single learners when the levels of class distribution change. Most classification learning algorithms are designed to maximize the overall accuracy rate and assume that training instances are uniformly distributed; however, the overall accuracy does not represent correct prediction on the minority class, which is the class of interest to users. The overall accuracy metric is therefore ineffective for evaluating the performance of classifiers in extremely imbalanced data. This study investigates the effect of varying levels of class distribution on different bagging predictors based on the Area Under the Receiver Operating Characteristic (ROC) Curve (AUC) as a performance metric, using an under-sampling technique on 14 data-sets with imbalanced class distributions. Our experimental results indicate that Decision Table (DTable) and RepTree are the learning algorithms with the best bagging AUC performance. The AUC performances of bagging predictors are statistically superior to single learners, with the exception of Support Vector Machines (SVM) and Decision Stump (DStump).
Fu, Y., Li, B., Zhu, X. & Zhang, C. 2014, 'Active Learning without Knowing Individual Instance Labels: A Pairwise Label Homogeneity Query Approach', IEEE Transactions On Knowledge And Data Engineering, vol. 26, no. 4, pp. 808-822.
View/Download from: UTS OPUS or Publisher's site
Traditional active learning methods require the labeler to provide a class label for each queried instance. The labelers are normally highly skilled domain experts to ensure the correctness of the provided labels, which in turn results in expensive labeling cost. To reduce labeling cost, an alternative solution is to allow nonexpert labelers to carry out the labeling task without explicitly telling the class label of each queried instance. In this paper, we propose a new active learning paradigm, in which a nonexpert labeler is only asked whether a pair of instances belong to the same class, namely, a pairwise label homogeneity. Under such circumstances, our active learning goal is twofold: (1) decide which pair of instances should be selected for query, and (2) how to make use of the pairwise homogeneity information to improve the active learner. To achieve the goal, we propose a Pairwise Query on Max-flow Paths strategy to query pairwise label homogeneity from a nonexpert labeler, whose query results are further used to dynamically update a Min-cut model (to differentiate instances in different classes). In addition, a Confidence-based Data Selection measure is used to evaluate data utility based on the Min-cut model's prediction results. The selected instances, with inferred class labels, are included into the labeled set to form a closed-loop active learning process. Experimental results and comparisons with state-of-the-art methods demonstrate that our new active learning paradigm can result in good performance with nonexpert labelers.
Wu, J., Zhu, X., Zhang, C. & Yu, P.S. 2014, 'Bag Constrained Structure Pattern Mining for Multi-Graph Classification', IEEE Transactions on Knowledge and Data Engineering, vol. 26, no. 10, pp. 2382-2396.
View/Download from: UTS OPUS or Publisher's site
Li, B., Chen, L., Zhu, X. & Zhang, C. 2013, 'Noisy but Non-malicious User Detection in Social Recommender Systems', World Wide Web, vol. 16, no. 5-6, pp. 677-699.
View/Download from: UTS OPUS or Publisher's site
Social recommender systems largely rely on user-contributed data to infer users' preference. While this feature has enabled many interesting applications in social networking services, it also introduces unreliability to recommenders as users are allowed to insert data freely. Although detecting malicious attacks from social spammers has been studied for years, little work was done for detecting Noisy but Non-Malicious Users (NNMUs), which refers to those genuine users who may provide some untruthful data due to their imperfect behaviors. Unlike colluded malicious attacks that can be detected by finding similarly-behaved user profiles, NNMUs are more difficult to identify since their profiles are neither similar nor correlated from one another. In this article, we study how to detect NNMUs in social recommender systems. Based on the assumption that the ratings provided by a same user on closely correlated items should have similar scores, we propose an effective method for NNMU detection by capturing and accumulating user&acirc;s &acirc;self-contradictions&acirc;, i.e., the cases that a user provides very different rating scores on closely correlated items. We show that self-contradiction capturing can be formulated as a constrained quadratic optimization problem w.r.t. a set of slack variables, which can be further used to quantify the underlying noise in each test user profile. We adopt three real-world data sets to empirically test the proposed method. The experimental results show that our method (i) is effective in real-world NNMU detection scenarios, (ii) can significantly outperform other noisy-user detection methods, and (iii) can improve recommendation performance for other users after removing detected NNMUs from the recommender system.
Li, J., Bian, W., Tao, D. & Zhang, C. 2013, 'Learning Colours From Textures By Sparse Manifold Embedding', Signal Processing, vol. 93, no. 6, pp. 1485-1495.
View/Download from: UTS OPUS or Publisher's site
The capability of inferring colours from the texture (grayscale contents) of an image is useful in many application areas, when the imaging device/environment is limited. Traditional manual or limited automatic colour assignment involves intensive human
Zhu, X., Yu, Y., Ou, Y., Luo, D., Zhang, C. & Chen, J. 2013, 'System modeling of a smart-home healthy lifestyle assistant', Lecture Notes in Computer Science, vol. 7607, no. 1, pp. 65-78.
View/Download from: UTS OPUS or Publisher's site
A system modeling is presented for a Smart-home Healthy Lifestyle Assistant System (SHLAS), covering healthy lifestyle promotion by intelligently collecting and analyzing context information, executing control instruction and suggesting health plans for
Qin, Z., Wang, T., Zhang, C. & Zhang, S. 2013, 'Cost-sensitive classification with k-nearest neighbors', Lecture Notes in Computer Science, vol. 8041, pp. 112-131.
View/Download from: UTS OPUS or Publisher's site
Cost-sensitive learning algorithms are typically motivated by imbalance data in clinical diagnosis that contains skewed class distribution. While other popular classification methods have been improved against imbalance data, it is only unsolved to extend k-Nearest Neighbors (kNN) classification, one of top-10 datamining algorithms, to make it cost-sensitive to imbalance data. To fill in this gap, in this paper we study two simple yet effective cost-sensitive kNN classification approaches, called Direct-CS-kNN and Distance-CS-kNN. In addition, we utilize several strategies (i.e., smoothing, minimum-cost k value selection, feature selection and ensemble selection) to improve the performance of Direct-CS-kNN and Distance-CS-kNN. We conduct several groups of experiments to evaluate the efficiency with UCI datasets, and demonstrate that the proposed cost-sensitive kNN classification algorithms can significantly reduce misclassification cost, often by a large margin, as well as consistently outperform CS-4.5 with/without additional enhancements.
Song, M., Tao, D., Chen, C., Bu, J., Luo, J. & Zhang, C. 2012, 'Probabilistic Exposure Fusion', IEEE Transactions On Image Processing, vol. 21, no. 1, pp. 341-357.
View/Download from: UTS OPUS or Publisher's site
The luminance of a natural scene is often of high dynamic range (HDR). In this paper, we propose a new scheme to handle HDR scenes by integrating locally adaptive scene detail capture and suppressing gradient reversals introduced by the local adaptation. The proposed scheme is novel for capturing an HDR scene by using a standard dynamic range (SDR) device and synthesizing an image suitable for SDR displays. In particular, we use an SDR capture device to record scene details (i.e., the visible contrasts and the scene gradients) in a series of SDR images with different exposure levels. Each SDR image responds to a fraction of the HDR and partially records scene details. With the captured SDR image series, we first calculate the image luminance levels, which maximize the visible contrasts, and then the scene gradients embedded in these images. Next, we synthesize an SDR image by using a probabilistic model that preserves the calculated image luminance levels and suppresses reversals in the image luminance gradients. The synthesized SDR image contains much more scene details than any of the captured SDR image. Moreover, the proposed scheme also functions as the tone mapping of an HDR image to the SDR image, and it is superior to both global and local tone mapping operators. This is because global operators fail to preserve visual details when the contrast ratio of a scene is large, whereas local operators often produce halos in the synthesized SDR image. The proposed scheme does not require any human interaction or parameter tuning for different scenes. Subjective evaluations have shown that it is preferred over a number of existing approaches.
Wang, T., Qin, Z., Zhang, S. & Zhang, C. 2012, 'Cost-sensitive Classification with Deficient Labeled Data', Information Systems, vol. 37, no. 5, pp. 508-516.
View/Download from: UTS OPUS or Publisher's site
It is an actual and challenging issue to learn cost-sensitive models from those datasets that are with few labeled data and plentiful unlabeled data, because some time labeled data are very difficult, time consuming and/or expensive to obtain. To solve this issue, in this paper we proposed two classification strategies to learn cost-sensitive classifier from training datasets with both labeled and unlabeled data, based on Expectation Maximization (EM). The first method, Direct-EM, uses EM to build a semi-supervised classifier, then directly computes the optimal class label for each test example using the class probability produced by the learning model. The second method, CS-EM, modifies EM by incorporating misclassification cost into the probability estimation process. We conducted extensive experiments to evaluate the efficiency, and results show that when using only a small number of labeled training examples, the CS-EM outperforms the other competing methods on majority of the selected UCI data sets across different cost ratios, especially when cost ratio is high.
Liu, T.T., Lipnicki, D., Zhu, W., Tao, D., Zhang, C., Cui, Y., Jin, J., Sachdev, P. & Wen, W. 2012, 'Cortical Gyrification And Sulcal Spans In Early Stage Alzheimer'S Disease', PLoS One, vol. 7, no. 2, pp. 1-5.
View/Download from: UTS OPUS or Publisher's site
Alzheimer's disease (AD) is characterized by an insidious onset of progressive cerebral atrophy and cognitive decline. Previous research suggests that cortical folding and sulcal width are associated with cognitive function in elderly individuals, and th
Zhang, S., Chen, F., Wu, X., Zhang, C. & Wang, R. 2012, 'Mining bridging rules between conceptual clusters', Applied Intelligence, vol. 36, no. 1, pp. 108-118.
View/Download from: UTS OPUS or Publisher's site
Bridging rules take the antecedent and action from different conceptual clusters. They are distinguished from association rules (frequent itemsets) because (1) they can be generated by the infrequent itemsets that are pruned in association rule mining, and (2) they are measured by their importance including the distance between two conceptual clusters, whereas frequent itemsets are measured only by their support. In this paper, we first design two algorithms for mining bridging rules between clusters, and then propose two non-linear metrics to measure their interestingness. We evaluate these algorithms experimentally and demonstrate that our approach is promising.
Su, G., Ying, M. & Zhang, C. 2012, 'Session Communication and Integration', CoRR, vol. abs/1210.2125.
Cao, L., Zhang, H., Zhao, Y., Luo, D. & Zhang, C. 2011, 'Combined Mining: Discovering Informative Knowledge in Complex Data', IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, vol. 41, no. 3, pp. 699-712.
View/Download from: UTS OPUS or Publisher's site
Enterprise data mining applications often involve complex data such as multiple large heterogeneous data sources, user preferences, and business impact. In such situations, a single method or one-step mining is often limited in discovering informative knowledge. It would also be very time and space consuming, if not impossible, to join relevant large data sources for mining patterns consisting of multiple aspects of information. It is crucial to develop effective approaches for mining patterns combining necessary information from multiple relevant business lines, catering for real business settings and decision-making actions rather than just providing a single line of patterns. The recent years have seen increasing efforts on mining more informative patterns, e.g., integrating frequent pattern mining with classifications to generate frequent pattern-based classifiers. Rather than presenting a specific algorithm, this paper builds on our existing works and proposes combined mining as a general approach to mining for informative patterns combining components from either multiple data sets or multiple features or by multiple methods on demand. We summarize general frameworks, paradigms, and basic processes for multifeature combined mining, multisource combined mining, and multimethod combined mining. Novel types of combined patterns, such as incremental cluster patterns, can result from such frameworks, which cannot be directly produced by the existing methods. A set of real-world case studies has been conducted to test the frameworks, with some of them briefed in this paper. They identify combined patterns for informing government debt prevention and improving government service objectives, which show the flexibility and instantiation capability of combined mining in discovering informative knowledge in complex data.
Zhao, Y., Cao, J., Zhang, C. & Zhang, S. 2011, 'Enhancing grid-density based clustering for high dimensional data', Journal of Systems and Software, vol. 84, no. 9, pp. 1524-1539.
View/Download from: UTS OPUS or Publisher's site
We propose an enhanced grid-density based approach for clustering high dimensional data. Our technique takes objects (or points) as atomic units in which the size requirement to cells is waived without losing clustering accuracy. For efficiency, a new partitioning is developed to make the number of cells smoothly adjustable; a concept of the ith-order neighbors is defined for avoiding considering the exponential number of neighboring cells; and a novel density compensation is proposed for improving the clustering accuracy and quality. We experimentally evaluate our approach and demonstrate that our algorithm significantly improves the clustering accuracy and quality.
Zhu, X., Ding, W., Yu, P. & Zhang, C. 2011, 'One-class learning and concept summarization for data streams', Knowledge And Information Systems, vol. 28, no. 3, pp. 523-553.
View/Download from: UTS OPUS or Publisher's site
In this paper, we formulate a new research problem of concept learning and summarization for one-class data streams. The main objectives are to (1) allow users to label instance groups, instead of single instances, as positive samples for learning, and (2) summarize concepts labeled by users over the whole stream. The employment of the batch-labeling raises serious issues for stream-oriented concept learning and summarization, because a labeled instance group may contain non-positive samples and users may change their labeling interests at any time. As a result, so the positive samples labeled by users, over the whole stream, may be inconsistent and contain multiple concepts. To resolve these issues, we propose a one-class learning and summarization (OCLS) framework with two major components.
Zhu, X., Li, B., Wu, X., He, D. & Zhang, C. 2011, 'CLAP: Collaborative pattern mining for distributed information systems', Decision Support Systems, vol. 52, no. 1, pp. 40-51.
View/Download from: UTS OPUS or Publisher's site
The purpose of data mining from distributed information systems is usually threefold: (1) identifying locally significant patterns in individual databases; (2) discovering emerging significant patterns after unifying distributed databases in a single vie
Yang, T., Kecman, V., Cao, L., Zhang, C. & Huang, J. 2011, 'Margin-Based Ensemble Classifier For Protein Fold Recognition', Expert Systems with Applications, vol. 38, no. 10, pp. 12348-12355.
View/Download from: UTS OPUS or Publisher's site
Recognition of protein folding patterns is an important step in protein structure and function predictions. Traditional sequence similarity-based approach fails to yield convincing predictions when proteins have low sequence identities, while the taxonom
Qin, Y., Zhang, S. & Zhang, C. 2010, 'Combining Knn Imputation And Bootstrap Calibrated Empirical Likelihood For Incomplete Data Analysis', International Journal of Data Warehousing and Mining, vol. 6, no. 4, pp. 61-73.
View/Download from: UTS OPUS or Publisher's site
The k-nearest neighbor (kNN) imputation, as one of the most important research topics in incomplete data discovery, has been developed with great successes on industrial data. However, it is difficult to obtain a mathematical valid and simple procedure t
Cao, L., Zhao, Y., Zhang, H., Luo, D., Zhang, C. & Park, E. 2010, 'Flexible Frameworks For Actionable Knowledge Discovery', IEEE Transactions On Knowledge And Data Engineering, vol. 22, no. 9, pp. 1299-1312.
View/Download from: UTS OPUS or Publisher's site
Most data mining algorithms and tools stop at the mining and delivery of patterns satisfying expected technical interestingness. There are often many patterns mined but business people either are not interested in them or do not know what follow-up actio
Zhang, C., Yu, P. & Bill, D. 2010, 'Introduction to the Domain-Driven Data Mining Special Section', IEEE Transactions On Knowledge And Data Engineering, vol. 22, no. 6, pp. 753-754.
View/Download from: Publisher's site
IN the last decade, data mining has emerged as one of the most dynamic and lively areas in information technology. Although many algorithms and techniques for data mining have been proposed, they either focus on domainindependent techniques or on very specific domain problems. A general requirement in bridging the gap between academia and business is to cater to general domain-related issues surrounding real-life applications, such as constraints, organizational factors, domain expert knowledge, domain adaption, and operational knowledge. Unfortunately, these either have not been addressed, or have not been sufficiently addressed, in current data mining research and development. By common consent, experience seems to indicate that real-world data mining must, in the majority of cases, consider and involve the domain experts&acirc; role, domain knowledge, business intelligence, human intelligence, network intelligence, social intelligence, domain-specific constraints, as well as organizational factors and social issues in practice. However, it is difficult to merge the above domain factors with data mining models and processes. It is also challenging to discover knowledge that will support users to take decision-making actions.
Qin, Y., Zhang, S., Zhu, X., Zhang, J. & Zhang, C. 2009, 'Estimating confidence intervals for structural differences between contrast groups with missing data', Expert Systems with Applications, vol. 36, no. 3, pp. 6431-6438.
View/Download from: UTS OPUS or Publisher's site
Difference detection is actual and extremely useful for evaluating a new medicine B against a specified disease by comparing to an old medicine A, which has been used to treat the disease for many years. The datasets generated by applying A and B to the disease are called contrast groups and, main differences between the groups are the mean and distribution differences, referred to structural differences in this paper. However, contrast groups are only two samples obtained by limited applications or tests on A and B, and may be with missing values. Therefore, the differences derived from the groups are inevitably uncertain. In this paper, we propose a statistically sound approach for measuring this uncertainty by identifying the confidence intervals of structural differences between contrast groups. This method is designed significantly against most of those applications whose exact data distributions are unknown a priori, and the data may also be with missing values. We apply our approach to UCI datasets to illustrate its power as a new data mining technique for, such as, distinguishing spam from non-spam emails; and the benign breast cancer from the malign one.
Zhang, Z., Yang, P.H., Wu, X. & Zhang, C. 2009, 'An Agent-Based Hybrid System for Microarray Data Analysis', IEEE Intelligent Systems, vol. 24, no. 5, pp. 53-63.
View/Download from: UTS OPUS or Publisher's site
This article reports our experience in agent-based hybrid construction for microarray data analysis. The contributions are twofold: We demonstrate that agent-based approaches are suitable for building hybrid systems in general, and that a genetic ensemble system is appropriate for microarray data analysis in particular. Created using an agent-based framework, this genetic ensemble system for microarray data analysis excels in both sample classification accuracy and gene selection reproducibility.
Zhang, H., Zhao, Y., Cao, L., Zhang, C. & Bohlscheid, H. 2009, 'Customer Activity Sequence Classification for Debt Prevention in Social Security', Journal Of Computer Science And Technology, vol. 24, no. 6, pp. 1000-1009.
View/Download from: UTS OPUS or Publisher's site
From a data mining perspective, sequence classification is to build a classifier using frequent sequential patterns. However, mining for a complete set of sequential patterns on a large dataset can be extremely time-consuming and the large number of patterns discovered also makes the pattern selection and classifier building very time-consuming. The fact is that, in sequence classification, it is much more important to discover discriminative patterns than a complete pattern set. In this paper, we propose a novel hierarchical algorithm to build sequential classifiers using discriminative sequential patterns. Firstly, we mine for the sequential patterns which are the most strongly correlated to each target class. In this step, an aggressive strategy is employed to select a small set of sequential patterns. Secondly, pattern pruning and serial coverage test are done on the mined patterns. The patterns that pass the serial test are used to build the sub-classifier at the first level of the final classifier. And thirdly, the training samples that cannot be covered are fed back to the sequential pattern mining stage with updated parameters. This process continues until predefined interestingness measure thresholds are reached, or all samples are covered. The patterns generated in each loop form the sub-classifier at each level of the final classifier. Within this framework, the searching space can be reduced dramatically while a good classification performance is achieved. The proposed algorithm is tested in a real-world business application for debt prevention in social security area. The novel sequence classification algorithm shows the effectiveness and efficiency for predicting debt occurrences based on customer activity sequence data.
Ou, Y., Cao, L. & Zhang, C. 2009, 'Adaptive Anomaly Detection of Coupled Activity Sequences', The IEEE Intelligent Informatics Bulletin, vol. 10, no. 1, pp. 12-16.
View/Download from: UTS OPUS
Many real-life applications often involve multiple sequences, which are coupled with each other. It is unreasonable to either study the multiple coupled sequences separately or simply merge them into one sequence, because the information about their interacting relationships would be lost. Furthermore, such coupled sequences also have frequently significant changes which are likely to degrade the performance of trained model. Taking the detection of abnormal trading activity patterns in stock markets as an example, this paper proposes a Hidden Markov Model-based approach to address the above two issues. Our approach is suitable for sequence analysis on multiple coupled sequences and can adapt to the significant sequence changes automatically. Substantial experiments conducted on a real dataset show that our approach is effective.
Yan, X., Zhang, C. & Zhang, S. 2009, 'Genetic algorithm-based strategy for identifying association rules without specifying actual minimum support', Expert Systems with Applications, vol. 36, no. 2, pp. 3066-3076.
View/Download from: UTS OPUS or Publisher's site
We design a genetic algorithm-based strategy for identifying association rules without specifying actual minimum support. In this approach, an elaborate encoding method is developed, and the relative confidence is used as the fitness function. With genetic algorithm, a global search can be performed and system automation is implemented, because our model does not require the user-specified threshold of minimum support. Furthermore, we expand this strategy to cover quantitative association rule discovery. For efficiency, we design a generalized FP-tree to implement this algorithm. We experimentally evaluate our approach, and demonstrate that our algorithms significantly reduce the computation costs and generate interesting association rules only.
Qin, Y., Zhang, S., Zhu, X., Zhang, J. & Zhang, C. 2009, 'POP algorithm: Kernel-based imputation to treat missing values in knowledge discovery from databases', Expert Systems with Applications, vol. 36, no. 2, pp. 2794-2804.
View/Download from: UTS OPUS or Publisher's site
To complete missing values a solution is to use correlations between the attributes of the data. The problem is that it is difficult to identify relations within data containing missing values. Accordingly, we develop a kernel-based missing data imputation in this paper. This approach aims at making an optimal inference on statistical parameters: mean, distribution function and quantile after missing data are imputed. And we refer this approach to parameter optimization method (POP algorithm). We experimentally evaluate our approach, and demonstrate that our POP algorithm (random regression imputation) is much better than deterministic regression imputation in efficiency and generating an inference on the above parameters.
Yan, X., Zhang, C. & Zhang, S. 2009, 'Confidence Metrics For Association Rule Mining', Applied Artificial Intelligence, vol. 23, no. 8, pp. 713-737.
View/Download from: UTS OPUS or Publisher's site
We propose a simple, novel, and yet effective confidence metric for measuring the interestingness of association rules. Distinguishing from existing confidence measures, our metrics really indicate the positively companionate correlations between frequent itemsets. Furthermore, some desired properties are derived for examining the goodness of confidence measures in terms of probabilistic significance. We systematically analyze our metrics and traditional ones, and demonstrate that our new algorithm significantly captures the mainstream properties. Our approach will be useful to many association analysis tasks where one must provide actionable association rules and assist users to make quality decisions.
Zhang, S., Wu, X., Zhang, C. & Lu, J. 2008, 'Computing the minimum-support for mining frequent patterns', Knowledge And Information Systems, vol. 15, no. 2, pp. 233-257.
View/Download from: UTS OPUS or Publisher's site
Frequent pattern mining is based on the assumption that users can specify the minimum-support for mining their databases. It has been recognized that setting the minimum-support is a difficult task to users. This can hinder the widespread applications of these algorithms. In this paper we propose a computational strategy for identifying frequent itemsets, consisting of polynomial approximation and fuzzy estimation. More specifically, our algorithms (polynomial approximation and fuzzy estimation) automatically generate actual minimum-supports (appropriate to a database to be mined) according to users&acirc; mining requirements. We experimentally examine the algorithms using different datasets, and demonstrate that our fuzzy estimation algorithm fittingly approximates actual minimum-supports from the commonly-used requirements.
Cao, L., Zhang, C. & Zhou, M. 2008, 'Engineering open complex agent systems: A case study', IEEE Transactions On Systems Man And Cybernetics Part C-Applications And Reviews, vol. 38, no. 4, pp. 483-496.
View/Download from: UTS OPUS or Publisher's site
Open complex agent systems (OCAS) are becoming increasingly important in constructing problem-solving systems for enterprise applications. are challenging because they present very high system complexities involving human users and interactions with a ch
Cao, L., Zhao, Y. & Zhang, C. 2008, 'Mining impact-targeted activity patterns in imbalanced data', IEEE Transactions On Knowledge And Data Engineering, vol. 20, no. 8, pp. 1053-1066.
View/Download from: UTS OPUS or Publisher's site
Impact-targeted activities are rare but they may have a significant impact on the society. For example, isolated terrorism activities may lead to a disastrous event, threatening the national security. Similar issues can also be seen in many other areas.
Cao, L., Zhao, Y., Zhang, C. & Zhang, H. 2008, 'Activity mining: From activities to actions', International Journal Of Information Technology & Decision Making, vol. 7, no. 2, pp. 259-273.
View/Download from: UTS OPUS or Publisher's site
Activity data accumulated in real life, such as terrorist activities and governmental customer contacts, present special structural and semantic complexities. Activity data may lead to or be associated with significant business impacts, and result in important actions and decision making leading to business advantage. For instance, a series of terrorist activities may trigger a disaster to society, and large amounts of fraudulent activities in social security programs may result in huge government customer debt. Uncovering these activities or activity sequences can greatly evidence and/or enhance corresponding actions in business decisions. However, mining such data challenges the existing KDD research in aspects such as unbalanced data distribution and impact-targeted pattern mining. This paper investigates the characteristics and challenges of activity data, and the methodologies and tasks of activity mining based on case-study experience in the area of social security. Activity mining aims to discover high impact activity patterns in huge volumes of unbalanced activity transactions. Activity patterns identified can be used to prevent disastrous events or improve business decision making and processes. We illustrate the above issues and prospects in mining governmental customer contacts data to recover customer debt.
Zhang, S., Zhang, J., Zhu, X., Qin, Y. & Zhang, C. 2008, 'Missing Value Imputation Based on Data Clustering', Lecture Notes in Computer Science, vol. 4750, no. 2008, pp. 128-138.
View/Download from: UTS OPUS or Publisher's site
We propose an efficient nonparametric missing value imputation method based on clustering, called CMI (Clustering-based Missing value Imputation), for dealing with missing values in target attributes. In our approach, we impute the missing values of an instance A with plausible values that are generated from the data in the instances which do not contain missing values and are most similar to the instance A using a kernel-based method. Specifically, we first divide the dataset (including the instances with missing values) into clusters. Next, missing values of an instance A are patched up with the plausible values generated from A&acirc;s cluster. Extensive experiments show the effectiveness of the proposed method in missing value imputation task.
Zhang, H., Zhao, Y., Cao, L. & Zhang, C. 2007, 'Class Association Rule Mining with Multiple Imbalanced Attributes', Lecture Notes in Computer Science, vol. 4830, pp. 827-831.
View/Download from: UTS OPUS
In this paper, we propose a novel framework to deal with data imbalance in class association rule mining. In each class association rule, the right-hand is a target class while the left-hand may contain one or more attributes. This framework is focused on the multiple imbalanced attributes on the left-hand. In the proposed framework, the rules with and without imbalanced attributes are processed in parallel. The rules without imbalanced attributes are mined through standard algorithm while the rules with imbalanced attributes are mined based on new defined measurements. Through simple transformation, these measurements can be in a uniform space so that only a few parameters need to be specified by user. In the case study, the proposed algorithm is applied into social security field. Although some attributes are severely imbalanced, the rules with minority of the imbalanced attributes have been mined efficiently.
Zhang, Z. & Zhang, C. 2007, 'Building agent-based hybrid intelligent system: A case study', Web Intelligence and Agent Systems-An international journal, vol. 5, no. 3, pp. 255-271.
View/Download from: UTS OPUS
Many complex problems (e.g., financial investment planning, foreign exchange trading, data mining from large/multiple databases) require hybrid intelligent systems that integrate many intelligent techniques (e.g., fuzzy logic, neural networks, and genetic algorithms). However, hybrid intelligent systems are difficult to develop because they have a large number of parts or components that have many interactions. On the other hand, agents offer a new and often more appropriate route to the development of complex systems, especially in open and dynamic environments. Thus, this paper discusses the development of an agent-based hybrid intelligent system for financial investment planning, in which a great number of heterogeneous computing techniques/packages are easily integrated into a unifying agent framework. This shows that agent technology can indeed facilitate the development of hybrid intelligent systems.
Yan, X., Zhang, S. & Zhang, C. 2007, 'On data structures for association rule discovery', Applied Artificial Intelligence, vol. 21, no. 2, pp. 57-79.
View/Download from: UTS OPUS or Publisher's site
Systematically we study data structures used to implement the algorithms of association rule mining, including hash tree, itemset tree, and FP-tree (frequent pattern tree). Further, we present a generalized FP-tree in an applied context. This assists in better understanding existing association-rule-mining strategies. In addition, we discuss and analyze experimentally the generalized k-FP-tree, and demonstrate that the generalized FP-tree reduces the computation costs significantly. This study will be useful to many association analysis tasks where one must provide really interesting rules and develop efficient algorithms for identifying association rules.
Zhang, S., Zhang, J. & Zhang, C. 2007, 'EDUA: An efficient algorithm for dynamic database mining', Information Sciences, vol. 177, no. 13, pp. 2756-2767.
View/Download from: UTS OPUS or Publisher's site
Maintaining frequent itemsets (patterns) is one of the most important issues faced by the data mining community. While many algorithms for pattern discovery have been developed, relatively little work has been reported on mining dynamic databases, a major area of application in this field. In this paper, a new algorithm, namely the Efficient Dynamic Database Updating Algorithm (EDUA), is designed for mining dynamic databases. It works well when data deletion is carried out in any subset of a database that is partitioned according to the arrival time of the data. A pruning technique is proposed for improving the efficiency of the EDUA algorithm. Extensive experiments are conducted to evaluate the proposed approach and it is demonstrated that the EDUA is efficient.
Cao, L., Luo, D. & Zhang, C. 2007, 'Knowledge actionability: satisfying technical and business interestingness', International Journal of Business Intelligence and Data Mining, vol. 2, no. 4, pp. 496-514.
View/Download from: UTS OPUS or Publisher's site
Traditionally, knowledge actionability has been investigated mainly by developing and improving technical interestingness. Recently, initial work on technical subjective interestingness and business-oriented profit mining presents general potential, while it is a long-term mission to bridge the gap between technical significance and business expectation. In this paper, we propose a two-way significance framework for measuring knowledge actionability, which highlights both technical interestingness and domain-specific expectations. We further develop a fuzzy interestingness aggregation mechanism to generate a ranked final pattern set balancing technical and business interests. Real-life data mining applications show the proposed knowledge actionability framework can complement technical interestingness while satisfy real user needs.
Chen, Q., Chen, P.Y. & Zhang, C. 2007, 'Detecting inconsistency in biological molecular databases using ontologies', Data Mining and Knowledge Discovery, vol. 15, no. 2, pp. 275-296.
View/Download from: UTS OPUS or Publisher's site
The rapid growth of life science databases demands the fusion of knowledge from heterogeneous databases to answer complex biological questions. The discrepancies in nomenclature, various schemas and incompatible formats of biological databases, however, result in a significant lack of interoperability among databases. Therefore, data preparation is a key prerequisite for biological database mining. Integrating diverse biological molecular databases is an essential action to cope with the heterogeneity of biological databases and guarantee efficient data mining. However, the inconsistency in biological databases is a key issue for data integration. This paper proposes a framework to detect the inconsistency in biological databases using ontologies. A numeric estimate is provided to measure the inconsistency and identify those biological databases that are appropriate for further mining applications. This aids in enhancing the quality of databases and guaranteeing accurate and efficient mining of biological databases.
Cao, L. & Zhang, C. 2007, 'The Evolution of KDD: Towards Domain-Driven Data Mining', International Journal of Pattern Recognition and Artificial Intelligence, vol. 21, no. 4, pp. 677-692.
View/Download from: UTS OPUS or Publisher's site
Traditionally, data mining is an autonomous data-driven trial-and-error process. Its typical task is to let data tell a story disclosing hidden information, in which domain intelligence may not be necessary in targeting the demonstration of an algorithm. Often knowledge discovered is not generally interesting to business needs. Comparably, real-world applications rely on knowledge for taking effective actions. In retrospect of the evolution of KDD, this paper briefly introduces domain-driven data mining to complement traditional KDD. Domain intelligence is highlighted towards actionable knowledge discovery, which involves aspects such as domain knowledge, people, environment and evaluation. We illustrate it through mining activity patterns in social security data.
Qin, Y., Zhang, S., Zhu, X., Zhang, J. & Zhang, C. 2007, 'Semi-parametric optimization for missing data imputation', Applied Intelligence, vol. 27, no. 1, pp. 79-88.
View/Download from: UTS OPUS or Publisher's site
Missing data imputation is an important issue in machine learning and data mining. In this paper, we propose a new and efficient imputation method for a kind of missing data: semi-parametric data. Our imputation method aims at making an optimal evaluation about Root Mean Square Error (RMSE), distribution function and quantile after missing-data are imputed. We evaluate our approaches using both simulated data and real data experimentally, and demonstrate that our stochastic semi-parametric regression imputation is much better than existing deterministic semi-parametric regression imputation in efficiency and effectiveness.
Zhang, C., Ying, M. & Qiao, B. 2006, 'Universal Programmable Devices For Unambiguous Discrimination', Physical Review A, vol. 74, no. 4, pp. 1-9.
View/Download from: UTS OPUS
We discuss the problem of designing unambiguous programmable discriminators for any n unknown quantum states in an m-dimensional Hilbert space. The discriminator is a fixed measurement that has two kinds of input registers: the program registers and the
Cao, L., Zhang, C. & Liu, J. 2006, 'Ontology-Based Integration Of Business Intelligence', Web Intelligence and Agent Systems: An International Journal, vol. 4, no. 3, pp. 313-325.
View/Download from: UTS OPUS
The integration of Business Intelligence (BI) has been taken bybusiness decision-makers as an effective means to enhance enterprise "soft power" and added value in the reconstruction and revolution oftraditional industries. The existing solutions based on structuralintegration are to pack together data warehouse (DW), OLAP, data mining(DM) and reporting systems from different vendors. BI system users arefinally delivered a reporting system in which reports, data models,dimensions and measures are predefined by system designers. As aresult of a survey in the US, 85% of DW projects based on the above solutions failed to meet their intended objectives. In this paper, wesummarize our investigation on the integration of BI on the basis ofsemantic integration and structural interaction. Ontology-basedintegration of BI is discussed for semantic interoperability inintegrating DW, OLAP and DM. A hybrid ontological structure isintroduced which includes conceptual view, analytical view and physicalview. These views are matched with user interfaces, DW and enterpriseinformation systems, respectively. Relevant ontological engineeringtechniques are developed for ontology namespace, semantic relationships,and ontological transformation, mapping and query in this ontologicalspace. The approach is promising for business-oriented, adaptive andautomatic integration of BI in the real world. Operational decisionmaking experiments within a telecom company have demonstrated that a BI system utilizing the proposed approach is more flexible.
Cao, L. & Zhang, C. 2006, 'Domain-driven data mining: A practical methodology', International Journal of Data Warehousing and Mining, vol. 2, no. 4, pp. 49-65.
View/Download from: UTS OPUS
Extant data mining is based on data-driven methodologies. It either views data mining as an autonomous data-driven, trial-and-error process or only analyzes business issues in an isolated, case-by-case manner. As a result, very often the knowledge discovered generally is not interesting to real business needs. Therefore, this article proposes a practical data mining methodology referred to as domain-driven data mining, which targets actionable knowledge discovery in a constrained environment for satisfying user preference. The domain-driven data mining consists of a DDID-PD framework that considers key components such as constraint-based context, integrating domain knowledge, human-machine cooperation, in-depth mining, actionability enhancement, and iterative refinement process. We also illustrate some examples in mining actionable correlations in Australian Stock Exchange, which show that domain-driven data mining has potential to improve further the actionability of patterns for practical use by industry and business.
Yan, X., Zhang, C. & Zhang, S. 2005, 'ARMGA: Identifying interesting association rules with genetic algorithms', Applied Artificial Intelligence, vol. 19, no. 7, pp. 677-689.
View/Download from: UTS OPUS or Publisher's site
Priori-like algorithms for association rules mining have relied on two user-specified thresholds: minimum support and minimum confidence. There are two significant challenges to applying these algorithms to real-world applications: database-dependent min
Cheng, X., Ouyang, D., Jiang, Y. & Zhang, C. 2005, 'An improved model-based method to test circuit faults', Theoretical Computer Science, vol. 341, no. 1-3, pp. 150-161.
View/Download from: UTS OPUS or Publisher's site
This paper presents an improved model-based reasoning method to test circuit faults. The testing procedure is applicable even when the target system contains multiple faulty modes. Using our method, the observation could be planned appropriately to guara
Zhao, Y., Zhang, C. & Zhang, S. 2005, 'A recent-biased dimension reduction technique for time series data', Advances In Knowledge Discovery And Data Mining, Proceedings, vol. 3518, pp. 751-757.
View/Download from: UTS OPUS
There are many techniques developed for tackling time series and most of them consider every part of a sequence equally. In many applications, however, recent data can often be much more interesting and significant than old data. This paper defines new r
Yu, J.X., Ou, Y., Zhang, C. & Zhang, S. 2005, 'Identifying interesting visitors through Web log classification', IEEE Intelligent Systems, vol. 20, no. 3, pp. 55-59.
View/Download from: UTS OPUS or Publisher's site
Web site owners have trouble identifying customer purchasing patterns from their Web logs because the two aren't directly related. Thus, organizations must understand their customers' behavior, preferences, and future needs. This imperative leads many companies to develop a great many e-service systems for data collection and analysis. Web mining is a popular technique for analyzing visitor activities in e-service systems. It mainly includes Web text mining, Web structure mining and Web log mining. Our Web log mining approach classifies a particular site's visitors into different groups on the basis of their purchase interest.
Zhang, C., Chen, P.Y., Chen, Q. & Zhang, S. 2005, 'Mining Inconsistent Secure Messages Toward Analyzing Security Protocols', International Journal of Intelligent Control and Systems, vol. 10, no. 1, pp. 77-85.
Zhang, C., Yang, Q. & Liu, B. 2005, 'Guest Editors'Introduction: Special Section on Intelligent Data Preparation', IEEE Trasaction on Knowledge & Data Engineering, vol. 17, no. 9, pp. 1163-1165.
Wu, X., Zhang, C. & Zhang, S. 2005, 'Database classification for multi-database mining', Information Systems, vol. 30, no. 1, pp. 71-88.
View/Download from: UTS OPUS or Publisher's site
Wang, J., Wu, X. & Zhang, C. 2005, 'Support vector machines based on K-means clustering for real-time business intelligence systems', Business Intelligence and Data Mininig, vol. 1, no. 1, pp. 54-64.
View/Download from: UTS OPUS
Cao, L., Zhang, C. & Dai, R. 2005, 'The OSOAD Methodology for Open Complex Agent Systems', International Journal of Intelligent Control and Systems, vol. 10, no. 4, pp. 277-285.
View/Download from: UTS OPUS
Open complex agent systems (OCAS) are middle-size or large-scale open agent organization. To engineer OCAS, agentcentric organization-oriented analysis, design and implementation, namely organization-oriented methodology (OOM), has emerged as a highly promising direction. A number of OOM-related approaches have been proposed; while there are some intrinsic issues hidden in them. For instance, some fundamental system attributes, such as system dynamics, are not covered by almost all of the existing approaches. In this paper, we summarize our investigation of existing approaches, and report a new OOM approach called OSOAD. The OSOAD approach consists of organizational abstraction (OA), organization-oriented analysis (OOA), agent service-oriented design (ASOD), and Java agent service -based implementation. OSOAD provides complete and deployable mechanisms for all software engineering phases. Specifically, we notice the transition supports from OA to OOA and ASOD. This approach has been built and deployed with the practical development of agent service -based financial trading and mining applications.
Cao, L., Zhang, C. & Dai, R. 2005, 'Organization-Oriented Analysis of Open Complex Agent Systems', International Journal of Intelligent Control and Systems, vol. 10, no. 2, pp. 114-122.
View/Download from: UTS OPUS
Organization-oriented analysis acts as the key step and foundation in building organization-oriented methodology (OOM) to engineer multi-agent systems especially open complex agent systems (OCAS). A number of existing approaches target OOM, while they are incompatible with each other, and none of them is available as a solid and practical tool for engineering OCAS. This paper summarizes our investigation in building a unified framework for abstracting and analyzing OCAS organizations. Our organizationoriented framework, referred to as ORGANISED, integrating and expanding existing approaches, explicitly captures the main attributes in an OCAS. Following this framework, individual modelbuilding blocks are developed for all ORGANISED members; both visual and formal specifications are utilized to present an intuitive and precise analysis . The above techniques have been deployed in developing an agent service-based trading and mining support infrastructure.
Wang, J., Zhang, C., Wu, X., Qi, H. & Wang, J. 2005, 'SVM-OD: SVM Method to Detect Outliers', Studies in Computational Intelligence, vol. 9, pp. 129-141.
View/Download from: UTS OPUS
Outlier detection is an important task in data mining because outliers can be either useful knowledge or noise. Many statistical methods have been applied to detect outliers, but they usually assume a given distribution of data and it is difficult to deal with high dimensional data. The Statistical Learning Theory (SLT) established by Vapnik et al. provides a new way to overcome these drawbacks. According to SLT Scholkopf et al. proposed a v-Support Vector Machine (v-SVM) and applied it to detect outliers. However, it is still difficult for data mining users to decide one key parameter in v-SVM. This paper proposes a new SVM method to detect outliers, SVM-OD, which can avoid this parameter. We provide the theoretical analysis based on SLT as well as experiments to verify the effectiveness of our method. Moreover, an experiment on synthetic data shows that SVM-OD can detect some local outliers near the cluster with some distribution while v-SVM cannot do that
Chen, Q., Zhang, C. & Zhang, S. 2005, 'ENDL: A logical framework for verifying secure transaction protocols', Knowledge And Information Systems, vol. 7, no. 1, pp. 84-109.
View/Download from: UTS OPUS or Publisher's site
This paper proposes a new logic for verifying secure transaction protocols. We have named this logic the ENDL (extension of non-monotonic dynamic logic). In this logic, timestamps and signed certificates are used for protecting against replays of old key
Zhang, S., Wu, X., Zhang, J. & Zhang, C. 2005, 'A decremental algorithm for maintaining frequent itemsets in dynamic databases', Data Warehousing And Knowledge Discovery, Proceedings, vol. 3589, pp. 305-314.
View/Download from: UTS OPUS
Data mining and machine learning must confront the problem of pattern maintenance because data updating is a fundamental operation in data management. Most existing data-mining algorithms assume that the database is static, and a database update requires
Qin, Z., Zhang, C., Xie, X. & Zhang, S. 2005, 'Dynamic test-sensitive decision trees with multiple cost scales', Fuzzy Systems And Knowledge Discovery, Pt 1, Proceedings, vol. 3613, pp. 402-405.
View/Download from: UTS OPUS
Previous work considering both test and misclassification costs rely on the assumption that the test cost and the misclassification cost must be defined on the same cost scale. However, it can be difficult to define the multiple costs on the same cost sc
Zhang, C., Zhang, Z. & Cao, L. 2005, 'Agents and data mining: Mutual enhancement by integration', Lecture Notes In Computer Science, vol. 3505, pp. 50-61.
View/Download from: UTS OPUS
This paper tells a story of synergism of two cutting edge technologies - agents and data mining. By integrating these two technologies, the power for each of them is enhanced. Integrating agents into data mining systems, or constructing data mining syste
Zhang, C., Qin, Z. & Yan, X. 2005, 'Association-based Segmentation for Chinese-crossed Query Expansion', The IEEE Intelligent Informatics Bulletin, vol. 5, no. 1, pp. 18-25.
View/Download from: UTS OPUS
Wu, X., Zhang, C. & Zhang, S. 2004, 'Efficient mining of both positive and negative association rules', ACM Transactions On Information Systems, vol. 22, no. 3, pp. 381-405.
View/Download from: UTS OPUS or Publisher's site
Zhang, S., Zhang, C. & Yu, J.X. 2004, 'Mining dependent patterns in probabilistic databases', Cybernetics And Systems, vol. 35, no. 4, pp. 399-424.
View/Download from: UTS OPUS or Publisher's site
Zhang, S., Lu, J. & Zhang, C. 2004, 'A fuzzy logic based method to acquire user threshold of minimum-support for mining association rules', Information Sciences, vol. 164, no. 1-4, pp. 1-16.
View/Download from: UTS OPUS or Publisher's site
Zhang, S., Zhang, C. & Yang, Q. 2004, 'Information Enhancement for data mining', IEEE Intelligent Systems, vol. 19, no. 2, pp. 12-13.
Zhang, C., Liu, M., Nie, W. & Zhang, S. 2004, 'Identifying Global Exceptional Patterns in multi-database mining', IEEE Computational Intelligence Bulletin, vol. 3, no. 1, pp. 19-24.
View/Download from: UTS OPUS
Yan, X., Zhang, C. & Zhang, S. 2004, 'Identifying software component association with genetic algorithm', International Journal Of Software Engineering And Knowledge Engineering, vol. 14, no. 4, pp. 441-447.
View/Download from: UTS OPUS or Publisher's site
Identifying software component association is useful for component management and component retrieval. In this paper we design an evolutionary strategy to understand software structure better and identify software component association, by using genetic algorithm. Our mining strategy is effective for global search, especially when the searched space is so large that it is hardly possible to use deterministic search method.
Zhang, S., Zhang, C. & Yu, J.X. 2004, 'An efficient strategy for mining exceptions in multi-databases', Information Sciences, vol. 165, no. 1-2, pp. 1-20.
View/Download from: UTS OPUS or Publisher's site
Zhang, S., Zhang, C. & Yang, Q. 2003, 'Data preparation for data mining', Applied Artificial Intelligence, vol. 17, no. 5-Jun, pp. 375-381.
View/Download from: UTS OPUS or Publisher's site
Data preparation is a fundamental stage of data analysis. While a lot of low-quality information is available in various data sources and on the Web, many organizations or companies are interested in how to transform the data into cleaned forms which can be used for high-profit purposes. This goal generates an urgent need for data analysis aimed at cleaning the raw data. In this paper, we first show the importance of data preparation in data analysis, then introduce some research achievements in the area of data preparation. Finally, we suggest some future directions of research and development
Zhang, C., Zhang, S. & Zhang, Z. 2003, 'Temporal constraint satisfaction in matrix method', Applied Artificial Intelligence, vol. 17, no. 2, pp. 135-154.
View/Download from: UTS OPUS or Publisher's site
Yan, X., Zhang, C. & Zhang, S. 2003, 'Toward databases mining: Pre-processing collected data', Applied Artificial Intelligence, vol. 17, no. 5-6, pp. 545-561.
View/Download from: UTS OPUS or Publisher's site
This paper presents a new means of selecting quality data for mining multiple data sources. Traditional data-mining strategies obtain necessary data from internal and external data sources and pool all the data into a huge homogeneous dataset for discovery. In contrast, our data-mining strategy identifies quality data from (internal and external) data sources for a mining task. A framework is advocated for generating quality data. Experimental results demonstrate that application of this new data collecting technique can not only identify quality data, but can also efficiently reduce the amount of data that must be considered during mining.
Zhang, S., Zhang, C. & Qin, Z. 2003, 'Modeling Temporal Semantics of Data', Asian Journal of Information Technology, vol. 2, no. 1, pp. 25-36.
View/Download from: UTS OPUS
Recently, many temporal query languages, such as TCAL and TQuel have been proposed for temporal databases. However, there are still some limitations such as the inadequacy on operating data with temporal elements and handling the semantics of time `NOW in these temporal query languages. After defining a new temporal relational algebra, in this paper we build a tuple calculus language based on gap-interval for temporal databases. This tuple calculus is designed to support time query, non-time query, and general temporal query. In particular, the semantics of time `NOW is well implemented in this language; and the first temporal-normalform of relations under the extended operators is closure in our temporal query language.
Zhang, S., Wu, X. & Zhang, C. 2003, 'Multi-database Mining', IEEE Computational Intelligence Bulletin, vol. 2, no. 1, pp. 5-13.
View/Download from: UTS OPUS
Zhang, C., Zhang, S. & Webb, G.I. 2003, 'Identifying Approximate Itemsets of Interets in Large Databases', Applied Intelligence, vol. 18, no. 1, pp. 91-104.
View/Download from: UTS OPUS or Publisher's site
Zhang, S. & Zhang, C. 2003, 'Discovering Associations in Very Large Databases by Approximating', Acta Cybernetica, vol. 16, pp. 155-177.
View/Download from: UTS OPUS
Zhang, Z., Zhang, C. & Zhang, S. 2003, 'An Agent-base hybrid Framework for Database Mining', Applied Artificial Intelligence, vol. 17, no. 5-6, pp. 383-398.
View/Download from: UTS OPUS or Publisher's site
Zhang, S., Zhang, C. & Yan, X. 2003, 'Post-mining: maintenance of association rules by weighting', Information Systems, vol. 28, no. 7, pp. 691-707.
View/Download from: UTS OPUS or Publisher's site
Li, Y., Zhang, C. & Zhang, S. 2003, 'Cooperative strategy for Web data mining and cleaning', Applied Artificial Intelligence, vol. 17, no. 5-6, pp. 443-460.
View/Download from: UTS OPUS or Publisher's site
While the Internet and World Wide Web have put a huge volume of low-quality information at the easy access of an information gathering system, filtering out irrelevant information has become a big challenge. In this paper, a Web data mining and cleaning strategy for information gathering is proposed. A data-mining model is presented for the data that come from multiple agents. Using the model, a data-cleaning algorithm is then presented to eliminate irrelevant data. To evaluate the data-cleaning strategy, an interpretation is given for the mining model according to evidence theory. An experiment is also conducted to evaluate the strategy using Web data. The experimental results have shown that the proposed strategy is efficient and promising.
Zhang, S. & Zhang, C. 2003, 'A probabilistic data model and its semantics', Journal Of Research And Practice In Information Technology, vol. 35, no. 4, pp. 237-256.
View/Download from: UTS OPUS
Zhang, S., Zhang, C. & Yang, Q. 2003, 'Data preparation for data mining', Applied Artificial Intelligence, vol. 17, no. 5-6, pp. 375-381.
Data preparation is a fundamental stage of data analysis. While a lot of low-quality information is available in various data sources and on the Web, many organizations or companies are interested in how to transform the data into cleaned forms which can be used for high-profit purposes. This goal generates an urgent need for data analysis aimed at cleaning the raw data. In this paper, we first show the importance of data preparation in data analysis, then introduce some research achievements in the area of data preparation. Finally, we suggest some future directions of research and development.
Luo, X., Zhang, C. & Jennings, N. 2002, 'A Hybrid Model for Sharing Information between Fuzzy, Uncertain and Default Reasoning Models in Multi-agent systems', International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, vol. 10, no. 4, pp. 401-450.
This paper develops a hybrid model which provides a unified framework for the following four kinds of reasoning: 1) Zadeh's fuzzy approximate reasoning; 2) truth-qualification uncertain reasoning with respect to fuzzy propositions; 3) fuzzy default reasoning (proposed, in this paper, as an extension of Reiter's default reasoning); and 4) truth-qualification uncertain default reasoning associated with fuzzy statements (developed in this paper to enrich fuzzy default reasoning with uncertain information). Our hybrid model has the following characteristics: 1) basic uncertainty is estimated in terms of words or phrases in natural language and basic propositions are fuzzy; 2) uncertainty, linguistically expressed, can be handled in default reasoning; and 3) the four kinds of reasoning models mentioned above and their combination models will be the special cases of our hybrid model. Moreover, our model allows the reasoning to be performed in the case in which the information is fuzzy, uncertain and partial. More importantly, the problems of sharing the information among heterogeneous fuzzy, uncertain and default reasoning models can be solved efficiently by using our model. Given this, our framework can be used as a basis for information sharing and exchange in knowledge-based multi-agent systems for practical applications such as automated group negotiations. Actually, to build such a foundation is the motivation of this paper.
Zhang, S. & Zhang, C. 2002, 'Encoding Probability Propagation in Belief Networks', IEEE Transactions on Systems, Man and Cybernetics (Part A), vol. 32, no. 4, pp. 526-531.
View/Download from: UTS OPUS or Publisher's site
Complexity reduction is an important task in Bayesian networks. Recently, an approach known as the linear potential function (LPF) model has been proposed for approximating Bayesian computations. The LPF model can effectively compress a conditional probability table into a linear function. This correspondence extends the LPF model to approximate propagation in Bayesian networks. The extension focuses on encoding probability propagation as a polynomial function for a class of tractable problems.
Zhang, S. & Zhang, C. 2002, 'Propagating Temporal Relations Of Intervals By Matrix', Applied Artificial Intelligence, vol. 16, pp. 1-27.
View/Download from: UTS OPUS or Publisher's site
Traditional temporal relations propagating is based on Allen's Interval Algebra. This paper proposes an alternative method to propagate temporal relations among intervals, in which 5 5 matrices are used to represent temporal relations of intervals. Hence, the propagation of temporal relations is transformed into a numerical computation. For efficiency, we use the special values of the thirteen matrices to determine the possible temporal relations between two given intervals by using only the final resultant matrix so as to optimize the propagation. To evaluate the utility of the proposed technique, we have implemented the matrix representation in Java. The experimental results demonstrate that the approach is efficient and promising.
Zhang, S. & Zhang, C. 2001, 'A model for compressing probabilities in belief networks.', Informatica, vol. 25, pp. 409-419.
Probabilistic reasoning with belief (Bayesian) networks is based on conditional probability matrices. Thus it suffers from NP-hard implementations. In particular, the amount of probabilistic information necessary for the computations is often overwhelming. So, compressing the conditional probability table is one of the most important issues faced by the probabilistic reasoning community. Santos suggested an approach (called linear potential functions) for compressing the information from a combinatorial amount to roughly linear in the number of random variable assignments. However, much of the information in Bayesian networks, in which there are no linear potential functions, would be fitted by polynomial approximating functions rather than by reluctantly linear functions. For this reason, we construct a polynomial method to compress the conditional probability table in this paper. We evaluated the proposed technique, and our experimental results demonstrate that the approach is efficient and promising. (pp. 409-419)
Zhang, Z. & Zhang, C. 1999, 'A Serving Agent For Integrating Soft Computing And Software Agents - Extended Summary', Advanced Topics In Artificial Intelligence, vol. 1747, pp. 476-477.
NA
Li, Y. & Zhang, C. 1999, 'Information-based Cooperation In Multiple Agent Systems - Extended Summary', Advanced Topics In Artificial Intelligence, vol. 1747, pp. 496-498.
NA
Wang, X., Yi, X., Lam, K., Zhang, C. & Okamoto, E. 1999, 'Secure Agent-mediated Auctionlike Negotiation Protocol For Internet Retail Commerce', Cooperative Information Agents Iii, Proceedings, vol. 1652, pp. 291-302.
NA
Zhang, M. & Zhang, C. 1999, 'Potential Cases, Methodologies, And Strategies Of Synthesis Of Solutions In Distributed Expert Systems', IEEE Transactions On Knowledge And Data Engineering, vol. 11, pp. 498-503.
NA
Luo, X. & Zhang, C. 1999, 'Proof Of The Correctness Of Emycin Sequential Propagation Under Conditional Independence Assumptions', IEEE Transactions On Knowledge And Data Engineering, vol. 11, pp. 355-359.
NA
Xu, Y.L. & Zhang, C. 1998, 'A Neural Network Diagnosis Model Without Disorder Independence Assumption', Pricai'98: Topics In Artificial Intelligence, vol. 1531, pp. 341-352.
View/Download from: UTS OPUS
Generally, the disorders in a neural network diagnosis model are assumed independent each other. In this paper, we propose a neural network model for diagnostic problem solving where the disorder independence assumption is no longer necessary. Firstly, we characterize the diagnostic tasks and the causal network which is used to represent the diagnostic problem, then we describe the neural network diagnosis model, finally, some experiment results will be given.
Luo, X. & Zhang, C. 1998, 'LSlNCF: A Hybrid Uncertain Reasoning Model Based On Probability', International Journal Of Uncertainty Fuzziness And Knowledge-based Systems, vol. 6, pp. 401-422.
View/Download from: Publisher's site
NA
Zhang, M. & Zhang, C. 1997, 'Methodologies Of Solution Synthesis In Distributed Expert Systems', Multi-agent Systems, vol. 1286, pp. 137-151.
NA
Yang, H. & Zhang, C. 1997, 'Application Of MAS In Implementing Rational Ip Routers On The Priced Internet', Multi-agent Systems, vol. 1286, pp. 166-180.
Negotiation
Zhang, C. 1995, 'The Design And Implementation Of A Knowledge-based Communication-system In A Framework For Distributed Expert-systems', IEEE Transactions On Communications, vol. 43, pp. 1926-1936.
NA
Zhang, C. 1994, 'Heterogeneous Transformation Of Uncertainties Of Propositions Among Inexact Reasoning Models', IEEE Transactions On Knowledge And Data Engineering, vol. 6, pp. 353-360.
NA
Zhang, C. 1992, 'Cooperation Under Uncertainty In Distributed Expert Systems', Artificial Intelligence, vol. 56, pp. 21-69.
View/Download from: UTS OPUS or Publisher's site
NA