Longbing Cao was awarded a PhD in computing science at UTS and another PhD in Pattern Recognition and Intelligent Systems from Chinese Academy of Sciences. He is a professor of information technology at the Faculty of Engineering and IT, UTS; and the founding Director of the UTS Advanced Analytics Institute, UTS. He is an ARC Future Fellow at Level 3 (professorial level), and the receipient of the Eureka Prize for Exellence in Data Science.
He has been promoting Data Science and Analytics research, education and development since he was a chief technology officer focused on business intelligence and then joined academia. In 2007, he named the first Australian lab Data Science and Knowledge Discovery lab at UTS. In 2011, he founded the first Australian group Advanced Analytics Institute at UTS which was the only group mentioned in several government papers. In 2011, he established the world-first research degrees: Master of Analytics (Research) and PhD: Analytics. In 2012, he established the annual Big Data Summit, which provides a platform for bridging the gap between academic, industry and government. In 2013, he founded the IEEE Task Force on Data Science and Advanced Analytics (DSAA) and IEEE Task Force on Behavior, Economic and Soci-cultural Computing (BESC). In 2014, he established the IEEE Conference on Data Science and Advanced Analytics and the ACM SIGKDD Australia and New Zealand Chapter. In 2015, he started the International Journal of Data Science and Analytics with Springer.
His main efforts in the past decades have been made on studying data science fundamental issues (in particular complex data and behaviours, and non-IID learning and behaviour informatics), general machine learning and artificial intelligence issues, and better practice and actionable discovery for enterprise data science and analytics-driven decision-support. He and his team made very earlier commitment in data science to real-life big data analytics issues for many large governmental and industrial organizations inside and outside Australia since over 15 years ago at UTS, including capital markets, social warfare, taxation, immigration, financial service, telecommunication, banking, insurance, health and medical service, online businesses, social networks, recommendation, marketing, airline business, transport, and education.
Before joining UTS, Longbing had several years of research experience in Chinese Academy of Sciences, and working experience in managing and leading industry and commercial projects in telecommunications, banking and publishing, as a manager or chief technology officer. He was also the Research Leader of the Data Mining Program at the Australian Capital Markets Cooperative Research Centre.
He is a Senior Member of IEEE, Computer and SMC Society since 2006, and a member of ACM. He is the founding Chair of ACM SIGKDD Australia and New Zealand Chapter, IEEE Task Force on Data Science and Advanced Analytics, and IEEE Task Force on Behavioral, Economic and Socio-cultural Computing. He has been taken chairing roles and program committee members in artificial intelligence, data mining and machine learning, e.g., general co-chair of KDD2015, DSAA and PAKDD, and on steering committees such as DSAA and PAKDD.
He is the Editor-in-Chief of International Journal of Data Science and Analytics (JDSA, Springer), and the asscoiate EIC of IEEE Intelligent Systems, in addition to serving on other editorial board such as ACM Computing Surveys.
Can supervise: YES
Longbing's major research interest covers data science (data analytics and mining), machine learning, and artificial intelligence, in particular, the following areas:
- Data science and big data analytics: as one of the pioneer researchers, he is initiating and leading research, education and development on data sciences, his interest on data mining and machine learning has been mainly focused on complex data analytics, non-iid learning for big data analytics, domain-driven data mining/actionable knowledge discovery, and multi-structured data learning for complex data and environment, as well as infrastructure, solutions, systems, algorithms and services for enterprise data mining and business analytics applications.
- Behavior and social informatics: he proposed and has been leading research on behavior informatics and behavior computing, focusing on complex behavior and social modeling and representation, behavior and social networking analysis, social media and sentiment analysis, group/community social behavior analysis, negative behavior analysis, behavior risk and impact modeling and analysis, high risk, impact and utility behavior and social pattern analysis, behavior evolution, active behavior management, and domain-specific behavior and social analysis applications etc.
- Agent mining: the proposed concept involving fundamental infrastructure, agent-based distributed multi-source data mining, agent behavior learning, agent-based cloud analytics, and applications in financial trading agents etc.
- Artificial intelligence and intelligent systems: including knowledge representation, software engineering and system design for open complex intelligent systems, metasynthetic computing and engineering, and learning systems.
In Data Science and Advanced Analytics, Longbing
- Was one of the very few originally advocated the concept of “Data Science”, founded and chairs the IEEE Task Force on Data Science and Advanced Analytics, the IEEE International Conference on Data Science and Advanced Analytics, and ACM SIGKDD Australia and New Zealand Chapter;
- Formed the lab “Data Science and Knowledge Discovery” at QCIS, and then the Advanced Analytics Institute (AAi) at UTS dedicated to data science research, education and development, laid the foundation for UTS data science research strength;
- Established the globally-first Master of Analytics (Research) and PhD Thesis: Analytics at UTS;
- Authored several position papers and monographs in data science in prestigious venues; and
- Provides high-impact consultancies to many tier-one industry and government organizations in Australia and globally.
Longbing is currently research-only. Had experience in teaching system development, user interface design, and lectures on data mining, behavior analytics, business intelligence and advanced analytics.
Cao, L, Lee, JG & Lin, X 2017, PC chairs' preface.
Provides a comprehensive overview and introduction to the concepts, methodologies, analysis, design and applications of metasynthetic computing and engineering. The author: • Presents an overview of complex systems, especially open complex giant systems such as the Internet, complex behavioural and social problems, and actionable knowledge discovery and delivery in the big data era. • Discusses ubiquitous intelligence in complex systems, including human intelligence, domain intelligence, social intelligence, network intelligence, data intelligence and machine intelligence, and their synergy through metasynthetic engineering. • Explains the concept and methodology of human-centred, human-machine-cooperated qualitative-to-quantitative metasynthesis for understanding and managing open complex giant systems, and its computing approach: metasynthetic computing. • Introduces techniques and tools for analysing and designing problem-solving systems for open complex problems and systems. Metasynthetic Computing and Engineering uses the systematology methodology in addressing system complexities in open complex giant systems, for which it may not only be effective to apply reductionism or holism. The book aims to encourage and inspire discussions, design, implementation and reflection of effective methodologies and tools for computing and engineering open complex systems and problems. Researchers, research students and practitioners in complex systems, artificial intelligence, data science, computer science, and even system science, cognitive science, behaviour science, and social science, will find this book invaluable.
© Springer-Verlag London 2012. 'Behavior' is an increasingly important concept in the scientific, societal, economic, cultural, political, military, living and virtual worlds. Behavior computing, or behavior informatics, consists of methodologies, techniques and practical tools for examining and interpreting behaviours in these various worlds. Behavior computing contributes to the in-depth understanding, discovery, applications and management of behavior intelligence. With contributions from leading researchers in this emerging field Behavior Computing: Modeling, Analysis, Mining and Decision includes chapters on: representation and modeling behaviors; behavior ontology; behaviour analysis; behaviour pattern mining; clustering complex behaviors; classification of complex behaviors; behaviour impact analysis; social behaviour analysis; organizational behaviour analysis; and behaviour computing applications. Behavior Computing: Modeling, Analysis, Mining and Decision provides a dedicated source of reference for the theory and applications of behavior informatics and behavior computing. Researchers, research students and practitioners in behavior studies, including computer science, behavioral science, and social science communities will find this state of the art volume invaluable.
Cao, L, Feng, Y & Zhong, J 2010, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics): Preface.View/Download from: Publisher's site
* Bridges the gap between business expectations and research output * Includes techniques, methodologies and case studies in real-life enterprise DM * Addresses new areas such as blog mining In the present thriving global economy a need has evolved for complex data analysis to enhance an organizations production systems, decision-making tactics, and performance. In turn, data mining has emerged as one of the most active areas in information technologies. Domain Driven Data Mining offers state-of the-art research and development outcomes on methodologies, techniques, approaches and successful applications in domain driven, actionable knowledge discovery.
Data Mining and Multi-agent Integration presents cutting-edge research, applications and solutions in data mining, and the practical use of innovative information technologies written by leading international researchers in the field. Topics examined include: Integration of multiagent applications and data mining Mining temporal patterns to improve agents behavior Information enrichment through recommendation sharing Automatic web data extraction based on genetic algorithms and regular expressions A multiagent learning paradigm for medical data mining diagnostic workbench A multiagent data mining framework Streaming data in complex uncertain environments Large data clustering A multiagent, multi-objective clustering algorithm Interactive web environment for psychometric diagnostics Anomalies detection on distributed firewalls using data mining techniques Automated reasoning for distributed and multiple source of data Video contents identification Data Mining and Multi-agent Integration is intended for students, researchers, engineers and practitioners in the field, interested in the synergy between agents and data mining. This book is also relevant for readers in related areas such as machine learning, artificial intelligence, intelligent systems, knowledge engineering, human-computer interaction, intelligent information processing, decision support systems, knowledge management, organizational computing, social computing, complex systems, and soft computing. © Springer Science+Business Media, LLC 2009. All rights reserved.
Data Mining for Business Applications presents state-of-the-art data mining research and development related to methodologies, techniques, approaches and successful applications. The contributions of this book mark a paradigm shift from "data-centered pattern mining" to "domain-driven actionable knowledge discovery (AKD)" for next-generation KDD research and applications. The contents identify how KDD techniques can better contribute to critical domain problems in practice, and strengthen business intelligence in complex enterprise applications. The volume also explores challenges and directions for future data mining research and development in the dialogue between academia and business. Part I centers on developing workable AKD methodologies, including: domain-driven data mining post-processing rules for actions domain-driven customer analytics the role of human intelligence in AKD maximal pattern-based cluster ontology mining Part II focuses on novel KDD domains and the corresponding techniques, exloring the mining of emergent areas and domains such as: social security data community security data gene sequences mental health information traditional Chinese medicine data cancer related data blog data sentiment information web data procedures moving object trajectories land use mapping higher education data flight scheduling algorithmic asset management Researchers, practitioners and university students in the areas of data mining and knowledge discovery, knowledge engineering, human-computer interaction, artificial intelligence, intelligent information processing, decision support systems, knowledge management, and KDD project management are sure to find this a practical and effective means of enhancing their understanding of and using data mining in their own projects. © 2009 Springer Science+Business Media, LLC All rights reserved.
Cao, L & Ruwei, D 2008, Open Complex Intelligent Systems: Fundamentals, Concepts, Analysis, Design and Implementation, 1, Posts & Telecom Press, Beijing, China.
one of nine books in computer science selected into China Key-Book Publishing Plan in 11th Five-Years (2006-2010)
Gorodetsky, V, Chengqi, Z, Skormin, V & Longbing, C 2007, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics): Preface.
Negative sequential patterns (NSPs), which capture both frequent occurring and nonoccurring behaviors, become increasingly important and sometimes play a role irreplaceable by analyzing occurring behaviors only. Repetition sequential patterns capture repetitions of patterns in different sequences as well as within a sequence and are very important to understand the repetition relations between behaviors. Though some methods are available for mining NSP and repetition positive sequential patterns (RPSPs), we have not found any methods for mining repetition NSP (RNSP). RNSP can help the analysts to further understand the repetition relationships between items and capture more comprehensive information with repetition properties. However, mining RNSP is much more difficult than mining NSP due to the intrinsic challenges of nonoccurring items. To address the above issues, we first propose a formal definition of repetition negative containment. Then, we propose a method to convert repetition negative containment to repetition positive containment, which fast calculates the repetition supports by only using the corresponding RPSP's information without rescanning databases. Finally, we propose an efficient algorithm, called e-RNSP, to mine RNSP efficiently. To the best of our knowledge, e-RNSP is the first algorithm to efficiently mine RNSP. Intensive experimental results on the first four real and synthetic datasets clearly show that e-RNSP can efficiently discover the repetition negative patterns; results on the fifth dataset prove the effectiveness of RNSP which are captured by the proposed method; and the results on the rest 16 datasets analyze the impacts of data characteristics on mining process.
Liu, Q, Do, TDT & Cao, L 2020, 'Answer Keyword Generation for Community Question Answering by Multi-aspect Gamma-Poisson Matrix Completion', IEEE Intelligent Systems.View/Download from: Publisher's site
IEEE Community question answering (CQA) recommends appropriate answers to existing and new questions. Such answer recommendation is challenging since CQA data is often sparse and decentralized and lacks sufficient information to generate suitable answers to existing questions. Matching answers to new questions is more challenging in modeling Q/A sparsity, generating answers to cold-start/novel questions, and integrating metadata about Q/A into models, etc. This paper addresses these issues by a novel statistical model to automatically generate answer keywords in CQA with multi-aspect Gamma-Poisson matrix completion (MAGIC). MAGIC is the first trial in CQA to model multiple aspects of Q/A sentence information in CQA by involving Q/A metadata, Q/A sparsity, and both lexical and semantic Q/A information in a hierarchical Gamma-Poisson model. MAGIC can efficiently generate answer keywords for both existing and new questions against nonnegative matrix factorization (MF), probability MF, and relevant Poisson factorization models w.r.t. recommending appropriate and informative answer keywords.
IEEE Revealing complex relations between entities (e.g., items within or between transactions) is of great significance for business optimization, prediction, and decision making. Such relations include not only co-occurrence-based explicit relations but also nonco-occurrence-based implicit ones. Explicit relations have been substantially studied by rule mining-based approaches, including association rule mining and causal rule discovery. In contrast, implicit relations have received much less attention but could be more actionable. In this paper, we focus on the implicit relations between items which rarely or never co-occur while each of them co-occurs with other identical items (link items) with a high probability. A framework integrates both explicit and hidden item dependencies and a corresponding efficient algorithm IRRMiner captures such implicit relations with implicit rule inference. Experimental results show that IRRMiner not only infers implicit rules of various sizes consisting of both frequent and infrequent items effectively, it also runs at least four times faster than IARMiner, a typical indirect association rule mining algorithm which can only mine size-2 indirect association rules between frequent items. IRRMiner is applied to make recommendations and shows that the identified implicit rules can increase recommendation reliability.
Wang, S, Cao, L, Hu, L, Berkovsky, S, Huang, X, Xiao, L & Lu, W 2020, 'Jointly Modeling Intra- and Inter-transaction Dependencies with Hierarchical Attentive Transaction Embeddings for Next-item Recommendation', IEEE Intelligent Systems.View/Download from: Publisher's site
IEEE A transaction-based recommender system (TBRS) attempts to predict the next item by modeling dependencies in transactional data. Generally, two kinds of dependency considered are intra-transaction dependency and inter-transaction dependency. Most existing TBRSs recommend next item by only modeling the intra-transaction dependency within the current transaction while ignoring inter-transaction dependency with recent transactions that may also affect the next item. However, not all recent transactions are relevant to the current one and next items, such that the relevant ones should be prioritized. In this paper, we propose a novel hierarchical attentive transaction embedding (HATE) model to tackle these issues. Specifically, a two-level attention mechanism integrates both item embeddings and transaction embeddings to build an attentive context representation incorporating both intra- and inter-transaction dependencies and to recommend the next item. Experimental evaluations using two real-world datasets of shopping transactions show that HATE significantly outperforms the state-of-the-art methods in terms of recommendation accuracy.
Zhu, C, Cao, L & Yin, J 2020, 'Unsupervised Heterogeneous Coupling Learning for Categorical Representation', IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1-1.View/Download from: Publisher's site
Dong, X, Qiu, P, Lu, J, Cao, L & Xu, T 2019, 'Mining Top- k Useful Negative Sequential Patterns via Learning.', IEEE transactions on neural networks and learning systems, vol. 30, no. 9, pp. 2764-2778.View/Download from: Publisher's site
As an important tool for behavior informatics, negative sequential patterns (NSPs) (such as missing a medical treatment) are sometimes much more informative than positive sequential patterns (PSPs) (e.g., attending a medical treatment) in many applications. However, NSP mining is at an early stage and faces many challenging problems, including 1) how to mine an expected number of NSPs; 2) how to select useful NSPs; and 3) how to reduce high time consumption. To solve the first problem, we propose an algorithm Topk-NSP to mine the k most frequent negative patterns. In Topk-NSP, we first mine the top- k PSPs using the existing methods, and then we use an idea which is similar to top- k PSPs mining to mine the top- k NSPs from these PSPs. To solve the remaining two problems, we propose three optimization strategies for Topk-NSP. The first optimization strategy is that, in order to consider the influence of PSPs when selecting useful top- k NSPs, we introduce two weights, wP and wN , to express the user preference degree for NSPs and PSPs, respectively, and select useful NSPs by a weighted support wsup. The second optimization strategy is to merge wsup and an interestingness metric to select more useful NSPs. The third optimization strategy is to introduce a pruning strategy to reduce the high computational costs of Topk-NSP. Finally, we propose an optimization algorithm Topk-NSP+. To the best of our knowledge, Topk-NSP+ is the first algorithm that can mine the top- k useful NSPs. The experimental results on four synthetic and two real-life data sets show that the Topk-NSP+ is very efficient in mining the top- k NSPs in the sense of computational cost and scalability.
Guo, B, Ouyang, Y, Guo, T, Cao, L & Yu, Z 2019, 'Enhancing Mobile App User Understanding and Marketing with Heterogeneous Crowdsourced Data: A Review', IEEE Access, vol. 7, pp. 68557-68571.View/Download from: Publisher's site
The mobile app market has been surging in recent years. It has some key differentiating characteristics which make it different from traditional markets. To enhance mobile app development and marketing, it is important to study the key research challenges such as app user profiling, usage pattern understanding, popularity prediction, requirement and feedback mining, and so on. This paper reviews CrowdApp, a research field that leverages heterogeneous crowdsourced data for mobile app user understanding and marketing. We first characterize the opportunities of the CrowdApp, and then present the key research challenges and state-of-the-art techniques to deal with these challenges. We further discuss the open issues and future trends of the CrowdApp. Finally, an evolvable app ecosystem architecture based on heterogeneous crowdsourced data is presented.
Hao, S, Shi, C, Niu, Z & Cao, L 2019, 'Modeling positive and negative feedback for improving document retrieval', Expert Systems with Applications, vol. 120, pp. 253-261.View/Download from: Publisher's site
© 2018 Elsevier Ltd Pseudo-relevance feedback (PRF) has evident potential for enriching the representation of short queries. Traditional PRF methods treat top-ranked documents as feedback, since they are assumed to be relevant to the query. However, some of these feedback documents may actually distract from the query topic for a range of reasons and accordingly downgrade PRF system performance. Such documents constitute negative examples (negative feedback) but could also be valuable in retrieval. In this paper, a novel framework of query language model construction is proposed in order to improve retrieval performance by integrating both positive and negative feedback. First, an improvement-based method is proposed to automatically identify the types of feedback documents (i.e. positive or negative) according to whether the document enhances the retrieval's effectiveness. Subsequently, based on the learned positive and negative examples, the positive feedback models and the negative feedback models are estimated using an Expectation-Maximization algorithm with the assumptions: the positive term distribution is affected by the context term distribution and the negative term distribution is affected by both the positive term distribution and the context term distribution (such that the positive feedback model upgrades the rankings of relevant documents and the negative feedback model prunes the irrelevant documents from a query). Finally, a content-based representativeness criterion is proposed in order to obtain the representative negative feedback documents. Experiments conducted on the TREC collections demonstrate that our proposed approach results in better retrieval accuracy and robustness than baseline methods.
Hu, L, Chen, Q, Cao, L, Jian, S, Zhao, H & Cao, J 2019, 'Evolving Coauthorship Modeling and Prediction via Time-Aware Paired Choice Analysis', IEEE Access, vol. 7, pp. 98639-98651.View/Download from: Publisher's site
Coauthorship prediction is challenging yet important for academic collaboration and novel research topics discovery. The challenges lie in the dynamics of social or organizational relationships, changing preferences of suitable collaborators, and the evolution of research interests or topics. However, most current approaches and systems developed so far are mainly based on past coauthorships from a static viewpoint and do not capture the above evolving characteristics in coauthoring. Accordingly, this paper proposes a time-aware approach to capture the evolving coauthorships from online academic databases in terms of capturing the dynamics of social relationships and research interests. In particular, in order to understand the underlying factors influencing researchers to make choices of coauthors, we incorporate choice modeling based on utility theory. More specifically, our model conducts a series of pairwise choices over a poset induced by a utility function so as to learn the preference over all candidate coauthors. To complete the model inference, a gradient-based algorithm is devised to efficiently learn the model parameters for large-scale data. Finally, extensive experiments conducted on a real-world dataset show that our approach consistently outperforms other state-of-the-art methods.
Jian, S, Pang, G, Cao, L, Lu, K & Gao, H 2019, 'CURE: Flexible Categorical Data Representation by Hierarchical Coupling Learning', IEEE Transactions on Knowledge and Data Engineering, vol. 14, no. 8, pp. 1-14.View/Download from: Publisher's site
IEEE The representation of categorical data with hierarchical coupling relationships (i.e., value to value cluster interactions) is very critical yet challenging for capturing data characteristics in learning tasks. This paper proposes a novel and flexible coupled unsupervised categorical data representation (CURE) framework which not only captures the hierarchical couplings but also is flexible to be instantiated for contrastive learning tasks. Based on two complementary value coupling functions, CURE is instantiated into two instances: the coupled data embedding (CDE) for clustering and the coupled outlier scoring of high-dimensional data (COSH) for outlier detection, by customizing the ways of value clustering and coupling learning between value clusters. CDE embeds categorical data into a new space in which features are independent and semantics are rich. COSH represents data with an outlying vector to capture complex outlying behaviors of objects in high-dimensional data. Substantial experiments show that CDE significantly outperforms three popular unsupervised embedding methods and three state-of-the-art similarity-based representation methods, and COSH performs significantly better than five state-of-the-art outlier detection methods on high-dimensional data sets. CDE and COSH are scalable and stable, linear to data size and quadratic to the number of features, and are insensitive to their parameters.
Zhang, Q, Shi, C, Niu, Z & Cao, L 2019, 'HCBC: A Hierarchical Case-Based Classifier Integrated with Conceptual Clustering', IEEE Transactions on Knowledge and Data Engineering, vol. 31, no. 1, pp. 152-165.View/Download from: Publisher's site
© 1989-2012 IEEE. The structured case representation improves case-based reasoning (CBR) by exploring structures in the case base and the relevance of case structures. Recent CBR classifiers have mostly been built upon the attribute-value case representation rather than structured case representation, in which the structural relations embodied in their representation structure are accordingly overlooked in improving the similarity measure. This results in retrieval inefficiency and limitations on the performance of CBR classifiers. This paper proposes a hierarchical case-based classifier, HCBC, which introduces a concept lattice to hierarchically organize cases. By exploiting structural case relations in the concept lattice, a novel dynamic weighting model is proposed to enhance the concept similarity measure. Based on this similarity measure, HCBC retrieves the top-K concepts that are most similar to a new case by using a bottom-up pruning-based recursive retrieval (PRR) algorithm. The concepts extracted in this way are applied to suggest a class label for the case by a weighted majority voting. Experimental results show that HCBC outperforms other classifiers in terms of classification performance and robustness on categorical data, and also works confidently well on numeric datasets. In addition, PRR effectively reduces the search space and greatly improves the retrieval efficiency of HCBC.
Deng, Z, Chen, J, Zhang, T, Cao, L & Wang, S 2018, 'Generalized Hidden-Mapping Minimax Probability Machine for the training and reliability learning of several classical intelligent models', Information Sciences, vol. 436-437, pp. 302-319.View/Download from: Publisher's site
© 2018 Elsevier Inc. Minimax Probability Machine (MPM) is a binary classifier that optimizes the upper bound of the misclassification probability. This upper bound of the misclassification probability can be used as an explicit indicator to characterize the reliability of the classification model and thus makes the classification model more transparent. However, the existing related work is constrained to linear models or the corresponding nonlinear models by applying the kernel trick. To relax such constraints, we propose the Generalized Hidden-Mapping Minimax Probability Machine (GHM-MPM). GHM-MPM is a generalized MPM. It is capable of training many classical intelligent models, such as feedforward neural networks, fuzzy logic systems, and linear and kernelized linear models for classification tasks, and realizing the reliability learning of these models simultaneously. Since the GHM-MPM, similarly to the classical MPM, was originally developed only for binary classification, it is further extended to multi-class classification by using the obtained reliability indices of the binary classifiers of two arbitrary classes. The experimental results show that GHM-MPM makes the trained models more transparent and reliable than those trained by classical methods.
Dong, X, Gong, Y & Cao, L 2018, 'F-NSP+: A fast negative sequential patterns mining method with self-adaptive data storage', Pattern Recognition, vol. 84, pp. 13-27.View/Download from: Publisher's site
© 2018 Elsevier Ltd Mining negative sequential patterns (NSP) is an important tool for nonoccurring behavior analysis, and it is much more challenging than mining positive sequential patterns (PSPs) due to the high computational complexity and huge search space when obtaining the support of negative sequential candidates (NSCs). Very few NSP mining algorithms are available and most of them are very inefficient since they obtain the support of NSC by scanning the database repeatedly. Instead, the state-of-the-art NSP mining algorithm e-NSP only uses the PSP's information stored in an array structure to 'calculate' the support of NSC by equations, without database re-scanning. This makes e-NSP highly efficient, particularly on sparse datasets. However, when datasets become dense, the key process to obtain the support of NSC in e-NSP becomes very time-consuming and needs to be improved. In this paper, we propose a novel and efficient data structure, a bitmap, to obtain the support of NSC. We correspondingly propose a fast NSP mining algorithm, f-NSP, which uses a bitmap to store the PSP's information and then obtain the support of NSC only by bitwise operations, which is much faster than the hash method in e-NSP. Experimental results on real-world and synthetic datasets show that f-NSP is not only tens to hundreds of times faster than e-NSP, but also saves more than ten-fold the storage spaces of e-NSP, particularly on dense datasets with a large number of elements in a sequence or a small number of itemsets. Further, we find that f-NSP consumes more storage space than e-NSP when PSP's support is less than a support threshold sdsup, a value obtained through our theoretical analysis of storage space. Accordingly, we propose a self-adaptive storage strategy and a corresponding algorithm f-NSP+to overcome this deficiency. f-NSP+can automatically choose a bitmap or an array structure to store PSP information according to PSP support. Experimental results show that f-NSP...
Hao, S, Shi, C, Niu, Z & Cao, L 2018, 'Concept coupling learning for improving concept lattice-based document retrieval', Engineering Applications of Artificial Intelligence, vol. 69, pp. 65-75.View/Download from: Publisher's site
© 2017 Elsevier Ltd The semantic information in any document collection is critical for query understanding in information retrieval. Existing concept lattice-based retrieval systems mainly rely on the partial order relation of formal concepts to index documents. However, the methods used by these systems often ignore the explicit semantic information between the formal concepts extracted from the collection. In this paper, a concept coupling relationship analysis model is proposed to learn and aggregate the intra- and inter-concept coupling relationships. The intra-concept coupling relationship employs the common terms of formal concepts to describe the explicit semantics of formal concepts. The inter-concept coupling relationship adopts the partial order relation of formal concepts to capture the implicit dependency of formal concepts. Based on the concept coupling relationship analysis model, we propose a concept lattice-based retrieval framework. This framework represents user queries and documents in a concept space based on fuzzy formal concept analysis, utilizes a concept lattice as a semantic index to organize documents, and ranks documents with respect to the learned concept coupling relationships. Experiments are performed on the text collections acquired from the SMART information retrieval system. Compared with classic concept lattice-based retrieval methods, our proposed method achieves at least 9%, 8% and 15% improvement in terms of average MAP, IAP@11 and P@10 respectively on all the collections.
Hu, L, Chen, Q, Zhao, H, Jian, S, Cao, L & Cao, J 2018, 'Neural Cross-Session Filtering: Next-Item Prediction Under Intra- and Inter-Session Context', IEEE Intelligent Systems, vol. 33, no. 6, pp. 57-67.View/Download from: Publisher's site
© 2018 IEEE. Classic recommender systems (RSs) often repeatedly recommend similar items to user historical profiles or recent purchases. For this, session-based RSs (SBRSs) are extensively studied in recent years. Current SBRSs often assume a rigid-order sequence, which does not fit in many real-world cases. In fact, the next-item recommendation depends on not only current session context but also historical sessions which are often neglected by current SBRSs. Accordingly, an SBRS over relaxed-order sequences with both intra- and inter-context is more pragmatic. Inspired by the successful experience in modern language modeling, we design an efficient neural architecture to model both intra- and inter-context for next item prediction.
Huang, Y, Cao, L, Zhang, J, Pan, L & Liu, Y 2018, 'Exploring feature coupling and model coupling for image source identification', IEEE Transactions on Information Forensics and Security, vol. 13, no. 12, pp. 3108-3121.View/Download from: Publisher's site
© 2005-2012 IEEE. Recently, there has been great interest in feature-based image source identification. Previous statistical learning-based methods usually regarded the identification process as a classification problem. They assumed the dependence of features and the dependence of models. However, the two assumptions are usually problematic because of the genuine coupling of features and models. To address the issues, in this paper, we propose a novel image source identification scheme. For the feature coupling, a coupled feature representation is adopted to analyze the coupled interaction among features. The coupling relations among features and their powers are measured with Pearson's correlations and integrated in a Taylor-like expansion manner. Regarding model coupling, a new coupled probability representation is developed. The model coupling relationships are characterized with conditional probabilities induced by the confusion matrix and then combined with the law of total probability. The experiments carried out on the Dresden image collection confirm the effectiveness of the proposed scheme. Via mining the feature coupling and model coupling, the identification accuracy can be significantly improved.
Jian, S, Cao, L, Lu, K & Gao, H 2018, 'Unsupervised Coupled Metric Similarity for Non-IID Categorical Data', IEEE Transactions on Knowledge and Data Engineering, vol. 30, no. 9, pp. 1810-1823.View/Download from: Publisher's site
© 1989-2012 IEEE. Appropriate similarity measures always play a critical role in data analytics, learning, and processing. Measuring the intrinsic similarity of categorical data for unsupervised learning has not been substantially addressed, and even less effort has been made for the similarity analysis of categorical data that is not independent and identically distributed (non-IID). In this work, a Coupled Metric Similarity (CMS) is defined for unsupervised learning which flexibly captures the value-to-attribute-to-object heterogeneous coupling relationships. CMS learns the similarities in terms of intrinsic heterogeneous intra-and inter-attribute couplings and attribute-to-object couplings in categorical data. The CMS validity is guaranteed by satisfying metric properties and conditions, and CMS can flexibly adapt to IID to non-IID data. CMS is incorporated into spectral clustering and k-modes clustering and compared with relevant state-of-the-art similarity measures that are not necessarily metrics. The experimental results and theoretical analysis show the CMS effectiveness of capturing independent and coupled data characteristics, which significantly outperforms other similarity measures on most datasets.
Lian, D, Zheng, K, Ge, Y, Cao, L, Chen, E & Xie, X 2018, 'GeoMF++: Scalable location recommendation via joint geographical modeling and matrix factorization', ACM Transactions on Information Systems, vol. 36, no. 3.View/Download from: Publisher's site
© 2018 ACM. Location recommendation is an important means to help people discover attractive locations. However, extreme sparsity of user-location matrices leads to a severe challenge, so it is necessary to take implicit feedback characteristics of user mobility data into account and leverage the location's spatial information. To this end, based on previously developed GeoMF, we propose a scalable and flexible framework, dubbed GeoMF++, for joint geographical modeling and implicit feedback-based matrix factorization. We then develop an efficient optimization algorithm for parameter learning, which scales linearly with data size and the total number of neighbor grids of all locations. GeoMF++ can be well explained from two perspectives. First, it subsumes two-dimensional kernel density estimation so that it captures spatial clustering phenomenon in user mobility data; Second, it is strongly connected with widely used neighbor additive models, graph Laplacian regularized models, and collectivematrix factorization. Finally, we extensively evaluate GeoMF++ on two large-scale LBSN datasets. The experimental results show that GeoMF++ consistently outperforms the state-of-the-art and other competing baselines on both datasets in terms of NDCG and Recall. Besides, the efficiency studies show that GeoMF++ is much more scalable with the increase of data size and the dimension of latent space.
© 2018 by the authors. Attributed networks consist of not only a network structure but also node attributes. Most existing community detection algorithms only focus on network structures and ignore node attributes, which are also important. Although some algorithms using both node attributes and network structure information have been proposed in recent years, the complex hierarchical coupling relationships within and between attributes, nodes and network structure have not been considered. Such hierarchical couplings are driving factors in community formation. This paper introduces a novel coupled node similarity (CNS) to involve and learn attribute and structure couplings and compute the similarity within and between nodes with categorical attributes in a network. CNS learns and integrates the frequency-based intra-attribute coupled similarity within an attribute, the co-occurrence-based inter-attribute coupled similarity between attributes, and coupled attribute-to-structure similarity based on the homophily property. CNS is then used to generate the weights of edges and transfer a plain graph to a weighted graph. Clustering algorithms detect community structures that are topologically well-connected and semantically coherent on the weighted graphs. Extensive experiments verify the effectiveness of CNS-based community detection algorithms on several data sets by comparing with the state-of-the-art node similarity measures, whether they involve node attribute information and hierarchical interactions, and on various levels of network structure complexity.
Wang, C, Chi, C-H, She, Z, Cao, L & Stantic, B 2018, 'Coupled Clustering Ensemble by Exploring Data Interdependence', ACM TRANSACTIONS ON KNOWLEDGE DISCOVERY FROM DATA, vol. 12, no. 6.View/Download from: Publisher's site
Wang, L, Bao, X, Chen, H & Cao, L 2018, 'Effective lossless condensed representation and discovery of spatial co-location patterns', Information Sciences, vol. 436-437, pp. 197-213.View/Download from: Publisher's site
© 2018 Elsevier Inc. A spatial co-location pattern is a set of spatial features frequently co-occuring in nearby geographic spaces. Similar to closed frequent itemset mining, closed co-location pattern (CCP) mining was proposed for losslessly condensing large collections of prevalent co-location patterns. However, the state-of-the-art condensation methods in mining CCP are inspired by closed frequent itemset mining and do not consider the intrinsic characteristics of spatial co-locations, e.g., the participation index and ratio in spatial feature interactions, thus causing serious containment issues in CCP mining. In this paper, we propose a novel lossless condensed representation of prevalent co-location patterns, Super Participation Index-closed (SPI-closed) co-location. An efficient SPI-closed Miner is also proposed to effectively capture the nature of spatial co-location patterns, alongside the development of three additional pruning strategies to make the SPI-closed Miner efficient. This method captures richer feature interactions in spatial co-locations and solves the containment issues in existing CCP methods. A performance evaluation conducted on both synthetic and real-life data sets shows that SPI-closed Miner reduces the number of CCPs by up to 50%, and runs much faster than the baseline CCP mining algorithm described in the literature.
Zhu, C, Cao, L, Liu, Q, Yin, J & Kumar, V 2018, 'Heterogeneous Metric Learning of Categorical Data with Hierarchical Couplings', IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, vol. 30, no. 7, pp. 1254-1267.View/Download from: Publisher's site
The 21st century has ushered in the age of big data and data economy, in which data DNA, which carries important knowledge, insights, and potential, has become an intrinsic constituent of all data-based organisms. An appropriate understanding of data DNA and its organisms relies on the new field of data science and its keystone, analytics. Although it is widely debated whether big data is only hype and buzz, and data science is still in a very early phase, significant challenges and opportunities are emerging or have been inspired by the research, innovation, business, profession, and education of data science. This article provides a comprehensive survey and tutorial of the fundamental aspects of data science: the evolution from data analysis to data science, the data science concepts, a big picture of the era of data science, the major challenges and directions in data innovation, the nature of data analytics, new industrialization and service opportunities in the data economy, the profession and competency of data education, and the future of data science. This article is the first in the field to draw a comprehensive big picture, in addition to offering rich observations, lessons, and thinking about data science and analytics.2017 Copyright is held by the owner/author(s).
© ACM 2017. WHILE DATA SCIENCE has emerged as an ambitious new scientific field, related debates and discussions have sought to address why science in general needs data science and what even makes data science a science. However, few such discussions concern the intrinsic complexities and intelligence in data science.
Ghosh, S, Li, J, Cao, L & Ramamohanarao, K 2017, 'Septic shock prediction for ICU patients via coupled HMM walking on sequential contrast patterns.', Journal of Biomedical Informatics, vol. 66, pp. 19-31.View/Download from: Publisher's site
BACKGROUND AND OBJECTIVE: Critical care patient events like sepsis or septic shock in intensive care units (ICUs) are dangerous complications which can cause multiple organ failures and eventual death. Preventive prediction of such events will allow clinicians to stage effective interventions for averting these critical complications. METHODS: It is widely understood that physiological conditions of patients on variables such as blood pressure and heart rate are suggestive to gradual changes over a certain period of time, prior to the occurrence of a septic shock. This work investigates the performance of a novel machine learning approach for the early prediction of septic shock. The approach combines highly informative sequential patterns extracted from multiple physiological variables and captures the interactions among these patterns via coupled hidden Markov models (CHMM). In particular, the patterns are extracted from three non-invasive waveform measurements: the mean arterial pressure levels, the heart rates and respiratory rates of septic shock patients from a large clinical ICU dataset called MIMIC-II. EVALUATION AND RESULTS: For baseline estimations, SVM and HMM models on the continuous time series data for the given patients, using MAP (mean arterial pressure), HR (heart rate), and RR (respiratory rate) are employed. Single channel patterns based HMM (SCP-HMM) and multi-channel patterns based coupled HMM (MCP-HMM) are compared against baseline models using 5-fold cross validation accuracies over multiple rounds. Particularly, the results of MCP-HMM are statistically significant having a p-value of 0.0014, in comparison to baseline models. Our experiments demonstrate a strong competitive accuracy in the prediction of septic shock, especially when the interactions between the multiple variables are coupled by the learning model. CONCLUSIONS: It can be concluded that the novelty of the approach, stems from the integration of sequence-based physiological pa...
Jiang, Y, Tsai, P, Yeh, WC & Cao, L 2017, 'A honey-bee-mating based algorithm for multilevel image segmentation using Bayesian theorem', Applied Soft Computing Journal, vol. 52, pp. 1181-1190.View/Download from: Publisher's site
© 2016 Elsevier B.V. The image thresholding techniques are considered as a must for objects segmentation, compression and target recognition, and they have been widely studied for the last few decades; for example, the multi-level thresholding methods, and as such (they) render more great challenges for image segmentation techniques that remain computationally more expensive, when their choices of threshold numbers were increased. Therefore, our aim was to propose an algorithm based on Bayesian theorem and the so-called honey-bee-mating algorithm (HBMA), called a Bayesian honey bee mating algorithm BHBMA. It can not only reduce the computational time and curse of dimensionality, but also can run more reliably and more stably. This enhanced capability was technically accomplished by embedding a new population initialization strategy based on the characteristics of multi-level thresholding technique in pixel-based intensity images arranged from lower grey levels to higher ones. Extensive experiments have shown that our proposed method outperformed other state-of-the-art algorithms empirically in terms of their effectiveness and efficiency, when applying to complex image processing scenario such as automatic target recognition.
© 2017.Traditional SVM-based multi-class classification algorithms mainly adopt the strategy of mapping the data set with all classes into a single feature space via a kernel function, in which SVM is constructed for each decomposed binary classification problem. However, it is not always possible to find an appropriate kernel function to render all the classes distinguishable in a single feature space, since each class is always derived from different data distributions. Consequently, the performance is not always as good as expected. To improve the performance of multi-class classification, this paper proposes an improved approach, called multi-state-mapping (MSM) with SVM based on hierarchical architecture, which maps the data set with all classes into different feature spaces at the different states of the decomposition of a multi-class classification problem in terms of a binary tree architecture. We prove that the computational complexity of MSM at its worst lies between that of the one-against-all scheme and one-against-one scheme. Substantial experiments have been conducted on sixteen UCI data sets to show the performance of our method. The statistical results show that MSM outperforms state-of-the-art methods in terms of accuracy and standard deviation.
Meng, X, Cao, L, Zhang, X & Shao, J 2017, 'Top-k coupled keyword recommendation for relational keyword queries', Knowledge and Information Systems, vol. 50, no. 3, pp. 883-916.View/Download from: Publisher's site
© 2016 Springer-Verlag LondonProviding top-k typical relevant keyword queries would benefit the users who cannot formulate appropriate queries to express their imprecise query intentions. By extracting the semantic relationships both between keywords and keyword queries, this paper proposes a new keyword query suggestion approach which can provide typical and semantically related queries to the given query. Firstly, a keyword coupling relationship measure, which considers both intra- and inter-couplings between each pair of keywords, is proposed. Then, the semantic similarity of different keyword queries can be measured by using a semantic matrix, in which the coupling relationships between keywords in queries are reserved. Based on the query semantic similarities, we next propose an approximation algorithm to find the most typical queries from query history by using the probability density estimation method. Lastly, a threshold-based top-k query selection method is proposed to expeditiously evaluate the top-k typical relevant queries. We demonstrate that our keyword coupling relationship and query semantic similarity measures can capture the coupling relationships between keywords and semantic similarities between keyword queries accurately. The efficiency of query typicality analysis and top-k query selection algorithm is also demonstrated.
Wang, Z & Cao, L 2017, 'Coupled Attribute Similarity Learning on Categorical Data for Multi-Label Classification', Journal- Beijing Institute of Technology (English Edition), vol. 26, no. 3, pp. 404-410.View/Download from: Publisher's site
© 2017 Editorial Department of Journal of Beijing Institute of Technology . In this paper a novel coupled attribute similarity learning method is proposed with the basis on the multi-label categorical data (CASonMLCD). The CASonMLCD method not only computes the correlations between different attributes and multi-label sets using information gain, which can be regarded as the important degree of each attribute in the attribute learning method, but also further analyzes the intra-coupled and inter-coupled interactions between an attribute value pair for different attributes and multiple labels. The paper compared the CASonMLCD method with the OF distance and Jaccard similarity, which is based on the MLKNN algorithm according to 5 common evaluation criteria. The experiment results demonstrated that the CASonMLCD method can mine the similarity relationship more accurately and comprehensively, it can obtain better performance than compared methods.
Wang, Z & Cao, L 2017, 'Novel Apriori-Based Multi-Label Learning Algorithm by Exploiting Coupled Label Relationship', Journal Beijing Institute of Technology (English Edition), vol. 26, no. 2, pp. 206-214.View/Download from: Publisher's site
© 2017 Editorial Department of Journal of Beijing Institute of Technology. It is a key challenge to exploit the label coupling relationship in multi-label classification (MLC) problems. Most previous work focused on label pairwise relations, in which generally only global statistical information is used to analyze the coupled label relationship. In this work, firstly Bayesian and hypothesis testing methods are applied to predict the label set size of testing samples within their k nearest neighbor samples, which combines global and local statistical information, and then apriori algorithm is used to mine the label coupling relationship among multiple labels rather than pairwise labels, which can exploit the label coupling relations more accurately and comprehensively. The experimental results on text, biology and audio datasets shown that, compared with the state-of-the-art algorithm, the proposed algorithm can obtain better performance on 5 common criteria.
Zhai, T, Gao, Y, Wang, H & Cao, L 2017, 'Classification of high-dimensional evolving data streams via a resource-efficient online ensemble', Data Mining and Knowledge Discovery, vol. 31, no. 5, pp. 1242-1265.View/Download from: Publisher's site
© 2017, The Author(s). A novel online ensemble strategy, ensemble BPegasos (EBPegasos), is proposed to solve the problems simultaneously caused by concept drifting and the curse of dimensionality in classifying high-dimensional evolving data streams, which has not been addressed in the literature. First, EBPegasos uses BPegasos, an online kernelized SVM-based algorithm, as the component classifier to address the scalability and sparsity of high-dimensional data. Second, EBPegasos takes full advantage of the characteristics of BPegasos to cope with various types of concept drifts. Specifically, EBPegasos constructs diverse component classifiers by controlling the budget size of BPegasos; it also equips each component with a drift detector to monitor and evaluate its performance, and modifies the ensemble structure only when large performance degradation occurs. Such conditional structural modification strategy makes EBPegasos strike a good balance between exploiting and forgetting old knowledge. Lastly, we first prove experimentally that EBPegasos is more effective and resource-efficient than the tree ensembles on high-dimensional data. Then comprehensive experiments on synthetic and real-life datasets also show that EBPegasos can cope with various types of concept drifts significantly better than the state-of-the-art ensemble frameworks when all ensembles use BPegasos as the base learner.
Zhang, D, Li, H, Jiang, X, Cao, L, Wen, Z, Yang, X & Xue, P 2017, 'Role of AP-2α and MAPK7 in the regulation of autocrine TGF-β/miR-200b signals to maintain epithelial-mesenchymal transition in cholangiocarcinoma', Journal of Hematology and Oncology, vol. 10, no. 1.View/Download from: Publisher's site
© 2017 The Author(s). Background: Cholangiocarcinoma (CCA) is characterized by early lymphatic, metastasis, and low survival rate. Epithelial-mesenchymal transition (EMT) is able to induce tumor metastasis. Although the TGF-β/miR-200 signals promote EMT in various types of cancer, the regulatory mechanism in CCA is still unclear. Methods: Expression of miR-200b, TGF-β, and EMT markers were measured in tumor samples and cell lines by qRT-PCR and western blot. CCK8 assay was performed to measure the cell viability. Transwell assay was used to evaluate migration and invasion. The target genes of miR-200b and transcription factor of TGF-β were analyzed using dual-luciferase reporter system. Results: We have demonstrated that CCA exhibited remarkable EMT phenotype and miR-200b was reduced in CCA patients (n = 20) and negatively correlated to TGF-β. Moreover, two CCA cells, HCCC, and RBE, with epithelial appearances treated with TGF-β, showed fibroblastic-like cell morphology with downregulated miR-200b expression. Forced expression of miR-200b abrogated TGF-β-induced EMT initiation, with decreased cell proliferation, migration, and invasion in vitro. Also, TFAP2A (encode AP-2α) and MAPK7 were found to be targeted by miR-200b to downregulate EMT and AP-2α inhibited miR-200b by directly promoting transcription of TGFB1. Overexpression of MAPK7 significantly reversed miR-200b-induced inhibition of EMT, migration, and proliferation by increasing the expression of TGF-β, cyclin D1, and Cdk2. Further, the administration of miR-200b induced a remarkably tumor regression in vivo and reduced the effect of TGF-β-related EMT in AP-2α and MAPK7-dependent manner. Conclusions: Our study highlights that miR-200b-based gene therapy is effective in the treatment of CCA.
Zhou, X, Chen, L, Zhang, Y, Qin, D, Cao, L, Huang, G & Wang, C 2017, 'Enhancing online video recommendation using social user interactions', VLDB Journal, vol. 26, no. 5, pp. 637-656.View/Download from: Publisher's site
© 2017, Springer-Verlag Berlin Heidelberg. The creation of media sharing communities has resulted in the astonishing increase of digital videos, and their wide applications in the domains like online news broadcasting, entertainment and advertisement. The improvement of these applications relies on effective solutions for social user access to videos. This fact has driven the research interest in the recommendation in shared communities. Though effort has been put into social video recommendation, the contextual information on social users has not been well exploited for effective recommendation. Motivated by this, in this paper, we propose a novel approach based on the video content and user information for the recommendation in shared communities. A new solution is developed by allowing batch video recommendation to multiple new users and optimizing the subcommunity extraction. We first propose an effective technique that reduces the subgraph partition cost based on graph decomposition and reconstruction for efficient subcommunity extraction. Then, we design a summarization-based algorithm which groups the clicked videos of multiple unregistered users and simultaneously provide recommendation to each of them. Finally, we present a nontrivial social updates maintenance approach for social data based on user connection summarization. We evaluate the performance of our solution over a large dataset considering different strategies for group video recommendation in sharing communities.
Hu, L, Cao, L, Cao, J, Gu, Z, Xu, G & Wang, J 2017, 'Improving the Quality of Recommendations for Users and Items in the Tail of Distribution', ACM Transactions on Information Systems, vol. 35, no. 3, pp. 1-25.View/Download from: Publisher's site
Fan, X, Xu, RYD, Cao, L & Song, Y 2017, 'Learning Nonparametric Relational Models by Conjugately Incorporating Node Information in a Network', IEEE Transactions on Cybernetics, vol. 47, no. 3, pp. 589-599.View/Download from: Publisher's site
Relational model learning is useful for numerous practical applications. Many algorithms have been proposed in recent years to tackle this important yet challenging problem. Existing algorithms utilize only binary directional link data to recover hidden network structures. However, there exists far richer and more meaningful information in other parts of a network which one can (and should) exploit. The attributes associated with each node, for instance, contain crucial information to help practitioners understand the underlying relationships in a network. For this reason, in this paper, we propose two models and their solutions, namely the node-information involved mixed-membership model and the node-information involved latent-feature model, in an effort to systematically incorporate additional node information. To effectively achieve this aim, node information is used to generate individual sticks of a stick-breaking process. In this way, not only can we avoid the need to prespecify the number of communities beforehand, the algorithm also encourages that nodes exhibiting similar information have a higher chance of assigning the same community membership. Substantial efforts have been made toward achieving the appropriateness and efficiency of these models, including the use of conjugate priors. We evaluate our framework and its inference algorithms using real-world data sets, which show the generality and effectiveness of our models in capturing implicit network structures.
Cao, L 2016, 'Data Science: Nature and Pitfalls', IEEE INTELLIGENT SYSTEMS, vol. 31, no. 5, pp. 66-75.
© 2016 The Authors. Published by Elsevier B.V. As an important tool for behavior informatics, negative sequential patterns (NSP) (such as missing medical treatments) are critical and sometimes much more informative than positive sequential patterns (PSP) (e.g. using a medical service) in many intelligent systems and applications such as intelligent transport systems, healthcare and risk management, as they often involve non-occurring but interesting behaviors. However, discovering NSP is much more difficult than identifying PSP due to the significant problem complexity caused by non-occurring elements, high computational cost and huge search space in calculating negative sequential candidates (NSC). So far, the problem has not been formalized well, and very few approaches have been proposed to mine for specific types of NSP, which rely on database re-scans after identifying PSP in order to calculate the NSC supports. This has been shown to be very inefficient or even impractical, since the NSC search space is usually huge. This paper proposes a very innovative and efficient theoretical framework: Set theory-based NSP mining (ST-NSP), and a corresponding algorithm, e-NSP, to efficiently identify NSP by involving only the identified PSP, without re-scanning the database. Accordingly, negative containment is first defined to determine whether a data sequence contains a negative sequence based on set theory. Second, an efficient approach is proposed to convert the negative containment problem to a positive containment problem. The NSC supports are then calculated based only on the corresponding PSP. This not only avoids the need for additional database scans, but also enables the use of existing PSP mining algorithms to mine for NSP. Finally, a simple but efficient strategy is proposed to generate NSC. Theoretical analyses show that e-NSP performs particularly well on datasets with a small number of elements in a sequence, a large number of itemsets and low minimum s...
Cui, P, Liu, H, Aggarwal, C & Wang, F 2016, 'Uncovering and Predicting Human Behaviors', IEEE INTELLIGENT SYSTEMS, vol. 31, no. 2, pp. 77-78.View/Download from: Publisher's site
In data publishing, anonymization techniques have been designed to provide privacy protection. Anatomy is an important techniques for privacy preserving in data publication and attracts considerable attention in the literature. However, anatomy is fragile under background knowledge attack and the presence attack. In addition, anatomy can only be applied into limited applications. To overcome these drawbacks, we propose an improved version of anatomy: permutation anonymization, a new anonymization technique that is more effective than anatomy in privacy protection, and in the meanwhile is able to retain significantly more information in the microdata. We present the detail of the technique and build the underlying theory of the technique. Extensive experiments on real data are conducted, showing that our technique allows highly effective data analysis, while offering strong privacy guarantees.
© 2015 Springer Science+Business Media New York Preferred navigation patterns (PNP) are those contiguous sequential patterns whose elements are preferred by users to be selected as the next steps between several different selections and are preferred by users to spend much time on. Such navigation path and time preferred patterns are more actionable than any other finds only considering either path or time in various web applications, such as web user navigation, targeted online advertising and recommendation. However, due to the conceptual confusion and limitation on navigation preference in the existing work, the corresponding algorithms cannot discover actionable preferred navigation patterns. In this paper, we study the problem of preferred navigation pattern mining by involving both navigation path and time length. Firstly, we carefully define the concepts of time preference and selection preference for time-related path sequences, which can well reflect user interests from the relative path selection and time consumption respectively. Secondly, we propose an efficient PNP-forest algorithm for identifying PNPs, by first introducing PNP-forest data structure, and then presenting PNP-forest growth and maintenance mechanism, associated with optimization strategies. Then we introduce a more efficient mining algorithm called PrefixSpan_Forest, which integrates the advantages of PrefixSpan and PNP-forest. The performance of these two algorithms are also evaluated and the results show that the algorithms can discover PNPs effectively.
Zheng, Z, Wei, W, Liu, C, Cao, W, Cao, L & Bhatia, M 2016, 'An effective contrast sequential pattern mining approach to taxpayer behavior analysis', World Wide Web, vol. 19, no. 4, pp. 633-651.View/Download from: Publisher's site
Data mining for client behavior analysis has become increasingly important in business, however further analysis on transactions and sequential behaviors would be of even greater value, especially in the financial service industry, such as banking and insurance, government and so on. In a real-world business application of taxation debt collection, in order to understand the internal relationship between taxpayers' sequential behaviors (payment, lodgment and actions) and compliance to their debt, we need to find the contrast sequential behavior patterns between compliant and non-compliant taxpayers. Contrast Patterns (CP) are defined as the itemsets showing the difference/discrimination between two classes/datasets (Dong and Li, 1999). However, the existing CP mining methods which can only mine itemset patterns, are not suitable for mining sequential patterns, such as time-ordered transactions in taxpayer sequential behaviors. Little work has been conducted on Contrast Sequential Pattern (CSP) mining so far. Therefore, to address this issue, we develop a CSP mining approach, e C S P, by using an effective CSP-tree structure, which improves the PrefixSpan tree (Pei et al., 2001) for mining contrast patterns. We propose some heuristics and interestingness filtering criteria, and integrate them into the CSP-tree seamlessly to reduce the search space and to find business-interesting patterns as well. The performance of the proposed approach is evaluated on three real-world datasets. In addition, we use a case study to show how to implement the approach to analyse taxpayer behaviour. The results show a very promising performance and convincing business value.
Hu, L, Cao, L, Cao, J, Gu, Z, Xu, G & Yang, D 2016, 'Learning Informative Priors from Heterogeneous Domains to Improve Recommendation in Cold-Start User Domains', ACM TRANSACTIONS ON INFORMATION SYSTEMS, vol. 35, no. 2.View/Download from: Publisher's site
Many existing recommendation methods such as matrix factorization (MF) mainly rely on user–item rating matrix, which sometimes is not informative enough, often suffering from the cold-start problem. To solve this challenge, complementary textual relations between items are incorporated into recommender systems (RS) in this paper. Specifically, we first apply a novel weighted textual matrix factorization (WTMF) approach to compute the semantic similarities between items, then integrate the inferred item semantic relations into MF and propose a two-level matrix factorization (TLMF) model for RS. Experimental results on two open data sets not only demonstrate the superiority of TLMF model over bench-mark methods, but also show the effectiveness of TLMF for solving the cold-start problem.
Cao, L, Xie, B, Yang, X, Liang, H, Jiang, X, Zhang, D, Xue, P, Chen, D & Shao, Z 2015, 'MIR-324-5p suppresses hepatocellular carcinoma cell invasion by counteracting ECM degradation through post-transcriptionally downregulating ETS1 and SP1', PLoS ONE, vol. 10, no. 7.View/Download from: Publisher's site
© 2015 Cao et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Hepatocellular carcinoma (HCC) is one of the common malignancies, which is highly metastatic and the third common cause of cancer deaths in the world. The invasion and metastasis of cancer cells is a multistep and complex process which is mainly initiated by extracellular matrix (ECM) degradation. Aberrant expression of microRNA has been investigated in HCC and shown to play essential roles during HCC progression. In the present study, we found that microRNA-324-5p (miR-324-5p) was downregulated in both HCC cell lines and tissues. Ectopic miR-324-5p led to the reduction of HCC cells invasive and metastatic capacity, whereas inhibition of miR-324-5p promoted the invasion of HCC cells. Matrix metalloproteinase 2 (MMP2) and MMP9, the major regulators of ECM degradation, were found to be downregulated by ectopic miR-324-5p, while upregulated by miR-324-5p inhibitor. E26 transformation-specific 1 (ETS1) and Specificity protein 1 (SP1), both of which could modulate MMP2 and MMP9 expression and activity, were presented as the direct targets of and downregulated by miR-324-5p. Downregulation of ETS1 and SP1 mediated the inhibitory function of miR-324-5p on HCC migration and invasion. Our study demonstrates that miR-324-5p suppresses hepatocellular carcinoma cell invasion and might provide new clues to invasive HCC therapy.
Deng, Z, Cao, L, Jiang, Y & Wang, S 2015, 'Minimax Probability TSK Fuzzy System Classifier: A More Transparent and Highly Interpretable Classification Model', IEEE TRANSACTIONS ON FUZZY SYSTEMS, vol. 23, no. 4, pp. 813-826.View/Download from: Publisher's site
© 2015 Elsevier Ltd. All rights reserved. Abstract The Robust Graph mode seeking by Graph Shift (Liu and Yan, 2010) (RGGS) algorithm represents a recent promising approach for discovering dense subgraphs in noisy data. However, there are no theoretical foundations for proving the convergence of the RGGS algorithm, leaving the question as to whether an algorithm works for solid reasons. In this paper, we propose a generic theoretical framework consisting of three key Graph Shift (GS) components: the simplex of a generated sequence set, the monotonic and continuous objective function and closed mapping. We prove that the GS-type algorithms built on such components can be transformed to fit Zangwill's theory, and the sequence set generated by the GS procedures always terminates at a local maximum, or at worst, contains a subsequence which converges to a local maximum of the similarity measure function. The framework is verified by theoretical analysis and experimental results of several typical GS-type algorithms.
Fariha, A, Ahmed, CF, Leung, CK, Samiullah, M, Pervin, S & Cao, L 2015, 'A new framework for mining frequent interaction patterns from meeting databases', ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, vol. 45, pp. 103-118.View/Download from: Publisher's site
Fournier-Viger, P, Wu, C-W, Tseng, VS, Cao, L & Nkambou, R 2015, 'Mining Partially-Ordered Sequential Rules Common to Multiple Sequences', IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, vol. 27, no. 8, pp. 2203-2216.View/Download from: Publisher's site
Jiang, Y, Tsai, P, Hao, Z & Cao, L 2015, 'Automatic multilevel thresholding for image segmentation using stratified sampling and Tabu Search', Soft Computing, vol. 19, no. 9, pp. 2605-2617.View/Download from: Publisher's site
Image segmentation techniques have been widely applied in many fields such as pattern recognition and feature extraction. For the primate visual attention model, the perceptual organization is an important process to automatically extract the desirable features. In this article, we propose a new method called an automatic multilevel thresholding algorithm using the stratified sampling and Tabu Search (AMTSSTS) by imitating the primate visual perceptual behaviors. In the AMTSSTS algorithm, a gray image is treated as a population with the gray values of pixels as the individuals. First, the image is evenly divided into several strata (blocks), and a sample is drawn from each stratum. Second, a Tabu Search-based optimization is applied to each sample to maximize the ratio between mean and variance for each sample. The threshold number and threshold values are preliminarily determined based on the optimized samples, and are further optimized by a deterministic method which includes a new local criterion function with property of local continuity of an image. Results of extensive simulations on Berkeley datasets indicate that AMTSSTS can obtain more effective, efficient and smooth segmentation, and can be applied to complex and real-time environments. © 2014 Springer-Verlag Berlin Heidelberg.
Liu, W, Deng, Z-H, Cao, L, Xu, X, Liu, H & Gong, X 2015, 'Mining Top K Spread Sources for a Specific Topic and a Given Node', IEEE TRANSACTIONS ON CYBERNETICS, vol. 45, no. 11, pp. 2472-2483.View/Download from: Publisher's site
Wang, C, Cao, L & Chi, C-H 2015, 'Formalization and Verification of Group Behavior Interactions', IEEE TRANSACTIONS ON SYSTEMS MAN CYBERNETICS-SYSTEMS, vol. 45, no. 8, pp. 1109-1124.View/Download from: Publisher's site
Wang, C, Dong, X, Zhou, F, Cao, L & Chi, C-H 2015, 'Coupled Attribute Similarity Learning on Categorical Data', IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, vol. 26, no. 4, pp. 781-797.View/Download from: Publisher's site
Yang, W, Gao, Y, Shi, Y & Cao, L 2015, 'MRM-Lasso: A Sparse Multiview Feature Selection Method via Low-Rank Analysis', IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, vol. 26, no. 11, pp. 2801-2815.View/Download from: Publisher's site
Yue, XD, Cao, LB, Miao, DQ, Chen, YF & Xu, B 2015, 'Multi-view attribute reduction model for traffic bottleneck analysis', KNOWLEDGE-BASED SYSTEMS, vol. 86, pp. 1-10.View/Download from: Publisher's site
Fan, X, Cao, L & Xu, RYD 2015, 'Dynamic Infinite Mixed-Membership Stochastic Blockmodel', IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, vol. 26, no. 9, pp. 2072-2085.View/Download from: Publisher's site
Cao, L & Joachims, T 2014, 'Behavior Computing', IEEE Intelligent Systems, vol. 29, no. 4, pp. 62-66.
Deng, Z, Choi, K-S, Cao, L & Wang, S 2014, 'T2FELA: Type-2 Fuzzy Extreme Learning Algorithm for Fast Training of Interval Type-2 TSK Fuzzy Logic System', IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, vol. 25, no. 4, pp. 664-676.View/Download from: Publisher's site
Liu, B, Xiao, Y, Yu, PS, Cao, L, Zhang, Y & Hao, Z 2014, 'Uncertain One-Class Learning and Concept Summarization Learning on Uncertain Data Streams', IEEE Transactions on Knowledge and Data Engineering, vol. 26, no. 2, pp. 468-484.View/Download from: Publisher's site
This paper presents a novel framework to uncertain one-class learning and concept summarization learning on uncertain data streams. Our proposed framework consists of two parts. First, we put forward uncertain one-class learning to cope with data of uncertainty. We first propose a local kernel-density-based method to generate a bound score for each instance, which refines the location of the corresponding instance, and then construct an uncertain one-class classifier (UOCC) by incorporating the generated bound score into a one-class SVM-based learning phase. Second, we propose a support vectors (SVs)-based clustering technique to summarize the concept of the user from the history chunks by representing the chunk data using support vectors of the uncertain one-class classifier developed on each chunk, and then extend k-mean clustering method to cluster history chunks into clusters so that we can summarize concept from the history chunks. Our proposed framework explicitly addresses the problem of one-class learning and concept summarization learning on uncertain one-class data streams. Extensive experiments on uncertain data streams demonstrate that our proposed uncertain one-class learning method performs better than others, and our concept summarization method can summarize the evolving interests of the user from the history chunks.
Liu, B, Xiao, Y, Yu, PS, Hao, Z & Cao, L 2014, 'An efficient orientation distance–based discriminative feature extraction method for multi-classification', Knowledge and Information Systems, vol. 39, no. 2, pp. 409-433.View/Download from: Publisher's site
Feature extraction is an important step before actual learning. Although many feature extraction methods have been proposed for clustering, classification and regression, very limited work has been done on multi-class classification problems. This paper proposes a novel feature extraction method, called orientation distance–based discriminative (ODD) feature extraction, particularly designed for multi-class classification problems. Our proposed method works in two steps. In the first step, we extend the Fisher Discriminant idea to determine an appropriate kernel function and map the input data with all classes into a feature space where the classes of the data are well separated. In the second step, we put forward two variants of ODD features, i.e., one-vs-all-based ODD and one-vs-one-based ODD features. We first construct hyper-plane (SVM) based on one-vs-all scheme or one-vs-one scheme in the feature space; we then extract one-vs-all-based or one-vs-one-based ODD features between a sample and each hyper-plane. These newly extracted ODD features are treated as the representative features and are thereafter used in the subsequent classification phase. Extensive experiments have been conducted to investigate the performance of one-vs-all-based and one-vs-one-based ODD features for multi-class classification. The statistical results show that the classification accuracy based on ODD features outperforms that of the state-of-the-art feature extraction methods.
Liu, B, Xiao, YS, Yu, PS, Hao, ZF & Cao, LB 2014, 'An Efficient Approach for Outlier Detection with Imperfect Data Labels', IEEE Transactions on Knowledge and Data Engineering, vol. 26, no. 7, pp. 1602-1616.View/Download from: Publisher's site
The task of outlier detection is to identify data objects that are markedly different from or inconsistent with the normal set of data. Most existing solutions typically build a model using the normal data and identify outliers that do not fit the represented model very well. However, in addition to normal data, there also exist limited negative examples or outliers in many applications, and data may be corrupted such that the outlier detection data is imperfectly labeled. These make outlier detection far more difficult than the traditional ones. This paper presents a novel outlier detection approach to address data with imperfect labels and incorporate limited abnormal examples into learning. To deal with data with imperfect labels, we introduce likelihood values for each input data which denote the degree of membership of an example toward the normal and abnormal classes respectively. Our proposed approach works in two steps. In the first step, we generate a pseudo training dataset by computing likelihood values of each example based on its local behavior. We present kernel k-means clustering method and kernel LOF-based method to compute the likelihood values. In the second step, we incorporate the generated likelihood values and limited abnormal examples into SVDD-based learning framework to build a more accurate classifier for global outlier detection. By integrating local and global outlier detection, our proposed method explicitly handles data with imperfect labels and enhances the performance of outlier detection. Extensive experiments on real life datasets have demonstrated that our proposed approaches can achieve a better tradeoff between detection rate and false alarm rate as compared to state-of-the-art outlier detection approaches.
Liu, H-D, Yang, M, Gao, Y & Cao, L 2014, 'Fast Local Histogram Specification', IEEE Transactions on Circuits and Systems for Video Technology, vol. 24, no. 11, pp. 1833-1843.View/Download from: Publisher's site
Local histogram specification (LHS) is a useful technique for image processing. However, LHS faces a critical computational challenge when it is applied to high-resolution high-precision images. The calculation of the values in the cumulative distribution function (CDF) and the mapped value for the central pixel in each sliding window is time consuming with the computational complexity O(s + L) of the state-of-the-art techniques, where s is the side length of the square window and L is the number of gray levels. In this paper, we propose a fast algorithm for LHS, called fast local histogram specification (FLHS). FLHS reduces the complexity of calculating the CDF value for the central pixel in each sliding window to O(s + root L), and the time complexity for the mapping procedure in each window to O(log L). This results in the overall time complexity of LHS reduced from O(s + L) to O(s + root L) in each sliding window. Theoretical analysis shows that the newly developed algorithm is efficient. Experimental results on the 8-bit and high-resolution high-precision (16-bit) images demonstrate the efficiency of our proposed algorithm.
Tu, E, Cao, L, Yang, J & Kasabov, N 2014, 'A novel graph-based k-means for nonlinear manifold clustering and representative selection', Neurocomputing, vol. 143, pp. 109-122.View/Download from: Publisher's site
Wang, C, Tong, T, Cao, L & Miao, B 2014, 'Non-parametric shrinkage mean estimation for quadratic loss functions with unknown covariance matrices', Journal of Multivariate Analysis, vol. 125, pp. 222-232.View/Download from: Publisher's site
In this paper, a shrinkage estimator for the population mean is proposed under known quadratic loss functions with unknown covariance matrices. The new estimator is non-parametric in the sense that it does not assume a specific parametric distribution for the data and it does not require the prior information on the population covariance matrix. Analytical results on the improvement of the proposed shrinkage estimator are provided and some corresponding asymptotic properties are also derived. Finally, we demonstrate the practical improvement of the proposed method over existing methods through extensive simulation studies and real data analysis.
Xiao, Y, Liu, B, Hao, Z & Cao, L 2014, 'A K-Farthest-Neighbor-based approach for support vector data description', Applied Intelligence, vol. 41, no. 1, pp. 196-211.View/Download from: Publisher's site
Xiao, YS, Liu, B, Hao, ZF & Cao, LB 2014, 'A Similarity-Based Classification Framework for Multiple-Instance Learning', IEEE Transactions on Cybernetics, vol. 44, no. 4, pp. 500-515.View/Download from: Publisher's site
Multiple-instance learning (MIL) is a generalization of supervised learning that attempts to learn useful information from bags of instances. In MIL, the true labels of instances in positive bags are not available for training. This leads to a critical challenge, namely, handling the instances of which the labels are ambiguous (ambiguous instances). To deal with these ambiguous instances, we propose a novel MIL approach, called similarity-based multiple-instance learning (SMILE). Instead of eliminating a number of ambiguous instances in positive bags from training the classifier, as done in some previous MIL works, SMILE explicitly deals with the ambiguous instances by considering their similarity to the positive class and the negative class. Specifically, a subset of instances is selected from positive bags as the positive candidates and the remaining ambiguous instances are associated with two similarity weights, representing the similarity to the positive class and the negative class, respectively. The ambiguous instances, together with their similarity weights, are thereafter incorporated into the learning phase to build an extended SVM-based predictive classifier. A heuristic framework is employed to update the positive candidates and the similarity weights for refining the classification boundary. Experiments on real-world datasets show that SMILE demonstrates highly competitive classification accuracy and shows less sensitivity to labeling noise than the existing MIL methods.
An image in social media, termed a social image, exhibits characteristics different from images widely discussed in image processing. They can be described by both content and social related attributes, called social image attributes, including visual contents, users, tags, and timestamps. There are strong coupling relationships between social image attributes, which make social images not independent and identically distributed (non-IID). By analyzing the relationships among these attributes, we can better understand the semantic activities conducted on such non-IID social images, hence enabling new applications including content organization, recommendation, and social activity understanding. In this article, we present a novel algorithm to analyze the coupling relationships between social images, which involves not only intra-coupled similarity within a social image attribute, but also inter-coupled similarity between attributes, in analyzing the non-IIDness of the similarity between social images. In particular, we propose a multi-entry version of the coupled similarity metric to deal with attributes (i.e., tags) which have a many-to-one relationship with respect to images. Experimental results on a Flickr group dataset show that the proposed algorithm captures coupling relationships and therefore achieves promising results in various applications, including image clustering and tagging.
Yang, W, Gao, Y, Cao, L, Yang, M & Shi, Y 2014, 'mPadal: a joint local-and-global multi-view feature selection method for activity recognition', APPLIED INTELLIGENCE, vol. 41, no. 3, pp. 776-790.View/Download from: Publisher's site
Yue, XD, Miao, DQ, Cao, LB, Wu, Q & Chen, YF 2014, 'An efficient color quantization based on generic roughness measure', Pattern Recognition, vol. 47, no. 4, pp. 1777-1789.View/Download from: Publisher's site
Wang, C, Cao, L, Gaussier, E, Li, J, Ou, Y & Luo, D 2014, 'Coupled Behavior Representation, Modeling, Analysis, and Reasoning', IEEE Intelligent Systems, vol. 29, no. 4, pp. 66-69.View/Download from: Publisher's site
Behavior refers to the action, reaction,
or property of an entity, human or otherwise,
to situations or stimuli in its
environment.1 The in-depth analysis
of behavior has been increasingly recognized
as a crucial means for understanding
and disclosing interior driving
forces and intrinsic cause-effects
on business and social applications,
including Web community analysis,
counter-terrorism, fraud detection,
and customer relationship management.
With the deepening and widening
of social/business intelligences and
their networking, the concept of behavior
is in great demand to be consolidated
and formalized to deeply
scrutinize the native behavior intention,
lifecycle, and impact on complex
problems and business issues.
Although there's an emerging focus
on deep behavior studies, such as social
network analysis,2 periodic behavior
analysis3 and behavior informatics
approach,1 previous research work has
mainly focused on individual behaviors
without considering the interactions of
them. However, with increasing network
and community-based events
as well as their applications, such as
group-based crime and social network
interactions, coupling relationships between
behaviors contribute to the intrinsic
causes and impacts of eventual
business and social problems. In the real-world applications, group behavior
interactions (that is, coupled behaviors)
are widely seen in natural, social,
and artificial behavior-related problems.
Complex behavior and social applications
often exhibit strong explicit
or implicit coupling relationships both
between their entities and properties.
Moreover, it's also quite difficult to
model, analyze, and check behaviors
coupled with one another due to the
complexity from data, domain, context,
and impact perspectives.
Due to the emerging popularity and
importance of coupled behaviors, the
representation, modeling, analysis,
mining and learning, and determination
of coupled behaviors are becoming
increasingly essential yet challenging
One key challenge in multi-label learning is how to exploit label dependency effectively, and existing methods mainly address this issue via training a prediction model for each label based on the combination of original features and the labels on which it depends on. However, the influence of label dependency might be depressed due to the significant imbalance in dimensionality of feature set and dependent label set in this way, also the dynamic interaction between labels cannot be utilized effectively. In this paper, we propose a new framework to exploit the dependencies between labels iteratively and interactively. Every label's prediction will be updated through iterative process of propagation, other than being determined directly by a prediction model. Specifically, we utilize a graph model to encode the dependencies between labels, and employ the random-walk with restart (RWR) strategy to propagate the dependency among all labels iteratively until the predictions for all the labels converge. We validate our approach by experiments, and the results demonstrate that it yields significant improvements compared with several state-of-the-art algorithms.
Cao, L 2013, 'Combined mining: Analyzing object and pattern relations for discovering and constructing complex yet actionable patterns', Wiley Interdisciplinary Reviews-Data Mining And Knowledge Discovery, vol. 3, no. 2, pp. 140-155.View/Download from: Publisher's site
Combined mining is a technique for analyzing object relations and pattern relations, and for extracting and constructing actionable knowledge (patterns or exceptions). Although combined patterns can be built within a single method, such as combined seque
Jiang, F, Dong, D, Cao, L & Frater, MR 2013, 'Agent-Based Self-Adaptable Context-Aware Network Vulnerability Assessment', IEEE Transactions on Network and Service Management, vol. 10, no. 3, pp. 255-270.View/Download from: Publisher's site
Immunology inspired computer security has attracted enormous attention as its potential impacts on the next generation service-oriented network operation system. In this paper, we propose a new agent-based threat awareness assessment strategy inspired by the human immune system to dynamically adapt against attacks. Specifically, this approach is based on the dynamic reconfiguration of the file access right for system calls or logs (e.g., file rewritability) with balanced adaptability and vulnerability. Based on an information-theoretic analysis on the coherently associations of adaptability, autonomy as well as vulnerability, a generic solution is suggested to break down their coherent links. The principle is to maximize context-situation awared systems' adaptability and reduce systems' vulnerability simultaneously. Experimental results show the efficiency of the proposed biological behaviour-inspired vulnerability awareness system.
Outlier detection is an important problem that has been studied within diverse research areas and application domains. Most existing methods are based on the assumption that an example can be exactly categorized as either a normal class or an outlier. However, in many real-life applications, data are uncertain in nature due to various errors or partial completeness. These data uncertainty make the detection of outliers far more difficult than it is from clearly separable data. The key challenge of handling uncertain data in outlier detection is how to reduce the impact of uncertain data on the learned distinctive classifier. This paper proposes a new SVDD-based approach to detect outliers on uncertain data. The proposed approach operates in two steps. In the first step, a pseudo-training set is generated by assigning a confidence score to each input example, which indicates the likelihood of an example tending normal class. In the second step, the generated confidence score is incorporated into the support vector data description training phase to construct a global distinctive classifier for outlier detection. In this phase, the contribution of the examples with the least confidence score on the construction of the decision boundary has been reduced. The experiments show that the proposed approach outperforms state-of-art outlier detection techniques.
Wang, C, Cao, L & Miao, B 2013, 'Optimal feature selection for sparse linear discriminant analysis and its applications in gene expression data', Computational Statistics and Data Analysis, vol. 66, pp. 140-149.View/Download from: Publisher's site
This work studies the theoretical rules of feature selection in linear discriminant analysis (LDA), and a new feature selection method is proposed for sparse linear discriminant analysis. An l1 minimization method is used to select the important features from which the LDA will be constructed. The asymptotic results of this proposed two-stage LDA (TLDA) are studied, demonstrating that TLDA is an optimal classification rule whose convergence rate is the best compared to existing methods. The experiments on simulated and real datasets are consistent with the theoretical results and show that TLDA performs favorably in comparison with current methods. Overall, TLDA uses a lower minimum number of features or genes than other approaches to achieve a better result with a reduced misclassification rate.
In this work, we redefined two important statistics, the CLRT test [Z. Bai, D. Jiang, J. Yao, S. Zheng, Corrections to LRT on large-dimensional covariance matrix by RMT, The Annals of Statistics 37 (6B) (2009) 38223840] and the LW test [O. Ledoit, M. Wolf, Some hypothesis tests for the covariance matrix when the dimension is large compared to the sample size, The Annals of Statistics (2002) 10811102] on identity tests for high dimensional data using random matrix theories. Compared with existing CLRT and LW tests, the new tests can accommodate data which has unknown means and non-Gaussian distributions. Simulations demonstrate that the new tests have good properties in terms of size and power. What is more, even for Gaussian data, our new tests perform favorably in comparison to existing tests. Finally, we find the CLRT is more sensitive to eigenvalues less than 1 while the LW test has more advantages in relation to detecting eigenvalues larger than 1.
Yang, W, Gao, Y & Cao, L 2013, 'TRASMIL: A local anomaly detection framework based on trajectory segmentation and multi-instance learning', Computer Vision And Image Understanding, vol. 117, no. 10, pp. 1273-1286.View/Download from: Publisher's site
Local anomaly detection refers to detecting small anomalies or outliers that exist in some subsegments of events or behaviors. Such local anomalies are easily overlooked by most of the existing approaches since they are designed for detecting global or l
Yu, D, Nanda, P, Cao, L & He, S 2013, 'TCTM: an evaluation framework for architecture design on wireless sensor networks', International Journal of Sensor Networks, vol. 14, no. 3, pp. 168-177.View/Download from: Publisher's site
This paper presents an evaluation framework for architecture designs on wireless sensor networks (WSNs). We introduce a simple evaluation model: triangular constraint tradeoffs model (TCTM) to grasp the essence of the architecture design consideration under transient wireless media characteristic and stringent limitation on energy and computing resource of WSNs. Based on this evaluation framework, we investigate the existing architectures proposed in literature from three main competing constraint aspects, namely generality, cost, and performance. Two important concepts: performance efficiency and deployment efficiency are identified and distinguished in overall architecture efficiency. With this powerful abstract and simple model, we describe the motivations of major body of WSNs architectures proposed in current literature. We also analyse the fundamental advantage and limitations of each class of architectures from TCTM perspective. We foresee the influence of evolving technology to futuristic architecture design. We believe our efforts will serve as a reference to orient researchers and system designers in this area
Zhang, D, Cao, L, Li, Y, Lu, H, Yang, X & Xue, P 2013, 'Expression of glioma-associated oncogene 2 (Gli 2) is correlated with poor prognosis in patients with hepatocellular carcinoma undergoing hepatectomy', World Journal of Surgical Oncology, vol. 11.View/Download from: Publisher's site
Background: Our previous studies showed that glioma-associated oncogene (Gli)2 plays an important role in the proliferation and apoptosis resistance of hepatocellular carcinoma (HCC) cells. The aim of this study was to explore the clinical significance of Gli2 expression in HCC.Methods: Expression of Gli2 protein was detected in samples from 68 paired HCC samples, the corresponding paraneoplastic liver tissues, and 20 normal liver tissues using immunohistochemistry. Correlation of the immunohistochemistry results with clinicopathologic parameters, prognosis, and the expression of E-cadherin, N-cadherin, and vimentin were analyzed.Results: Immunohistochemical staining showed high levels of Gli2 protein expression in HCC, compared with paraneoplastic and normal liver tissues (P < 0.05). This high expression level of Gli2 was significantly associated with tumor differentiation, encapsulation, vascular invasion, early recurrence, and intra-hepatic metastasis (P < 0.05). There was a significantly negative correlation between Gli2 and E-cadherin expression (r = -0.302, P < 0.05) and a significantly positive correlation between expression of Gli2 and expression of vimentin (r = -0.468, P < 0.05) and N-cadherin (r = -0.505, P < 0.05). Kaplan-Meier analysis showed that patients with overexpressed Gli2 had significantly shorter overall survival and disease-free survival times (P < 0.05). Multivariate analysis suggested that the level of Gli2 expression was an independent prognostic factor for HCC.Conclusions: Expression of Gli2 is high in HCC tissue, and is associated with poor prognosis in patients with HCC after hepatectomy. © 2013 Zhang et al.; licensee BioMed Central Ltd.
Zhou, J, Cao, L & Yang, N 2013, 'On the convergence of some possibilistic clustering algorithms', Fuzzy Optimization and Decision Making, vol. 12, no. 4, pp. 415-432.View/Download from: Publisher's site
In this paper, an analysis of the convergence performance is conducted for a class of possibilistic clustering algorithms (PCAs) utilizing the Zangwill convergence theorem. It is shown that under certain conditions the iterative sequence generated by a P
Wei, W, Li, J, Cao, L, Ou, Y & Chen, J 2013, 'Effective Detection of Sophisticated Online Banking Fraud in Extremely Imbalanced Data', World Wide Web, vol. 16, no. 4, pp. 449-475.View/Download from: Publisher's site
Sophisticated online banking fraud reflects the integrative abuse of resources in social, cyber and physical worlds. Its detection is a typical use case of the broad-based Wisdom Web of Things (W2T) methodology. However, there is very limited information available to distinguish dynamic fraud from genuine customer behavior in such an extremely sparse and imbalanced data environment, which makes the instant and effective detection become more and more important and challenging. In this paper, we propose an effective online banking fraud detection framework that synthesizes relevant resources and incorporates several advanced data mining techniques. By building a contrast vector for each transaction based on its customerâs historical behavior sequence, we profile the differentiating rate of each current transaction against the customerâs behavior preference. A novel algorithm, ContrastMiner, is introduced to efficiently mine contrast patterns and distinguish fraudulent from genuine behavior, followed by an effective pattern selection and risk scoring that combines predictions from different models. Results from experiments on large-scale real online banking data demonstrate that our system can achieve substantially higher accuracy and lower alert volume than the latest benchmarking fraud detection system incorporating domain knowledge and traditional fraud detection methods.
Actionable knowledge has been qualitatively and intensively studied in the social sciences. Its marriage with data mining is only a recent story. On the one hand, data mining has been booming for a while and has attracted an increasing variety of increas
Cao, L 2012, 'Social Security and Social Welfare Data Mining: An Overview', IEEE Transactions On Systems Man And Cybernetics Part C-Applications And Reviews, vol. 42, no. 6, pp. 837-853.View/Download from: Publisher's site
The importance of social security and social welfare business has been increasingly recognized in more and more countries. It impinges on a large proportion of the population and affects government service policies and peopleâs life quality. Typical welfare countries, such as Australia and Canada, have accumulated a huge amount of social security and social welfare data. Emerging business issues such as fraudulent outlays, and customer service and performance improvements challenge existing policies, as well as techniques and systems including data matching and business intelligence reporting systems. The need for a deep understanding of customers and customerâgovernment interactions through advanced data analytics has been increasingly recognized by the community at large. So far, however, no substantial work on the mining of social security and social welfare data has been reported. For the first time in data mining and machine learning, and to the best of our knowledge, this paper draws a comprehensive overall picture and summarizes the corresponding techniques and illustrations to analyze social security/welfare data, namely, social security datamining (SSDM), based on a thorough review of a large number of related references from the past half century. In particular, we introduce an SSDM framework, including business and research issues, social security/welfare services and data, as well as challenges, goals, and tasks in mining social security/welfare data. A summary of SSDM case studies is also presented with substantial citations that direct readers to more specific techniques and practices about SSDM.
Agent mining is an emerging interdisciplinary area that integrates multiagent systems, data mining and knowledge discovery, machine learning and other relevant areas. It brings new opportunities to tackling issues in relevant fields more efficiently by e
Jiang, X, Xiang, G, Wang, Y, Zhang, L, Yang, X, Cao, L, Peng, H, Xue, P & Chen, D 2012, 'MicroRNA-590-5p regulates proliferation and invasion in human hepatocellular carcinoma cells by targeting TGF-β RII', Molecules and Cells, vol. 33, no. 6, pp. 545-551.View/Download from: Publisher's site
MicroRNAs (miRNAs) are regulatory small non-coding RNAs that can regulate gene expression by binding to gene elements, such as the gene promotor 5'UTR, mainly in the 3'UTR of mRNA. One miRNA targets many mRNAs, which can be regulated by many miRNAs, leading to a complex metabolic network. In our study, we found that the expression level of miR-590-5p is higher in the human hepatocellular carcinoma cell line HepG2 than in the normal hepatocellular cell line L02. Downregulation of miR-590-5p inhibited proliferation and invasion of hepatocellu-lar carcinoma cells (HCCs). We also showed that expression of TGF-beta RII, which has been regarded as a regulator of tumor proliferation, invasion, and migration in hepa-tocellular carcinoma, is regulated by miRNA-590-5p. In addition, miR-590-5p downregulated the expression of TGF-beta RII by targeting the 3'UTR of mRNA. We also found that downregulation of miR-590-5p was associated with an elevation of TGF-beta RII and inhibition of proliferation and invasion in HepG2 cells. Furthermore, overexpression of miR-590-5p was associated with upregulation of TGF-beta RII and could promote proliferation and invasion in L02 cells. In conclusion, we determined that TGF-beta RII is a novel target of miRNA-590-5p. Thus, the role of TGF-beta RII in regulating proliferation and invasion of human HCCs is controlled by miR-590-5p. In other words, miR-590-5p promotes proliferation and invasion in human HCCs by directly targeting TGF-beta RII. © 2012 KSMCB.
Li, Z, He, Y, Cao, L, Wong, L & Li, J 2012, 'Conservation of water molecules in protein binding interfaces', International Journal of Bioinformatics Research and Applications, vol. 8, no. 3/4, pp. 228-244.View/Download from: Publisher's site
The conservation of interfacial water molecules has only been studied in small data sets consisting of interfaces of a specific function. So far, no general conclusions have been drawn from largescale analysis, due to the challenges of using structural alignment in large data sets. To avoid using structural alignment, we propose a solvated sequence method to analyse water conservation properties in protein interfaces. We first use water information to label the residues, and then align interfacial residues in a fashion similar to normal sequence alignment. Our results show that, for a watercontacting interfacial residue, substituting it into hydrophobic residues tends to desolvate the local area. Surprisingly, residues with short side chains also tend not to lose their contacting water, emphasising the role of water in shaping binding sites. Deeply buried water molecules are found more conserved in terms of their contacts with interfacial residues
Melli, G, Wu, X, Beinat, P, Bonchi, F, Cao, L, Duan, R, Faloutsos, C, Ghani, R, Kitts, B, Goethals, B, McLachlan, G, Pei, J, Srivastava, A & Zaiane, O 2012, 'Top-10 Data Mining Case Studies', International Journal of Information Technology and Decision Making, vol. 11, no. 2, pp. 389-400.View/Download from: Publisher's site
We report on the panel discussion held at the ICDM'10 conference on the top 10 data mining case studies in order to provide a snapshot of where and how data mining techniques have made significant real-world impact. The tasks covered by 10 case studies r
Yeh, W, Cao, L & Jin, J 2012, 'A Cellular Automata Hybrid Quasi-random Monte Carlo Simulation for Estimating the One-to-all Reliability of Acyclic Multistate Information Networks', International Journal Of Innovative Computing Information And Control, vol. 8, no. 3(B), pp. 2001-2014.
Many real-world systems (such as cellular telephones and ransportation) are acyclic multi-state information networks (AMIN). These networks are composed of multi-state nodes, with different states determined by a set of nodes that receive a signal directly from these multi-state nodes, without satisfying the conservation law. Evaluating the AMIN reliability arises at the design and exploitation stage of many types of technical systems. However, existing analytical methods fail to estimate AMIN reliability in a realistic time frame, even for smaller-sized AMINs. Hence, the main purpose of this article is to present a cellular automata hybrid quasi-Monte Carlo simulation (CA-HMC) by combining cellular automata (CA, to rapidly determine network states), pseudo-random sequences (PRS, to obtain the exibility of the network) and quasi-random sequences (QRS, to improve the accuracy) to obtain a high-quality estimation of AMIN reliability in order to improve the calculation efficiency. We use one benchmark example from well-known algorithms in literature to show the utility and performance of the proposed CA-HMC simulation when evaluating the one-to-all AMIN reliability.
Color image segmentation is always an important technique in image processing system. Highly precise segmentation with low computation complexity can be achieved through roughness measurement which approximate the color histogram based on rough set theor
Coupled behaviors refer to the activities of one to many actors who are associated with each other in terms of certain relationships. With increasing network and community-based events and applications, such as group-based crime and social network intera
Cao, L, Zhang, H, Zhao, Y, Luo, D & Zhang, C 2011, 'Combined Mining: Discovering Informative Knowledge in Complex Data', IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, vol. 41, no. 3, pp. 699-712.View/Download from: Publisher's site
Enterprise data mining applications often involve complex data such as multiple large heterogeneous data sources, user preferences, and business impact. In such situations, a single method or one-step mining is often limited in discovering informative knowledge. It would also be very time and space consuming, if not impossible, to join relevant large data sources for mining patterns consisting of multiple aspects of information. It is crucial to develop effective approaches for mining patterns combining necessary information from multiple relevant business lines, catering for real business settings and decision-making actions rather than just providing a single line of patterns. The recent years have seen increasing efforts on mining more informative patterns, e.g., integrating frequent pattern mining with classifications to generate frequent pattern-based classifiers. Rather than presenting a specific algorithm, this paper builds on our existing works and proposes combined mining as a general approach to mining for informative patterns combining components from either multiple data sets or multiple features or by multiple methods on demand. We summarize general frameworks, paradigms, and basic processes for multifeature combined mining, multisource combined mining, and multimethod combined mining. Novel types of combined patterns, such as incremental cluster patterns, can result from such frameworks, which cannot be directly produced by the existing methods. A set of real-world case studies has been conducted to test the frameworks, with some of them briefed in this paper. They identify combined patterns for informing government debt prevention and improving government service objectives, which show the flexibility and instantiation capability of combined mining in discovering informative knowledge in complex data.
Fu, X, Wang, Q, Chen, J, Huang, X, Chen, X, Cao, L, Tan, H, Li, W, Zhang, L, Bi, J, Su, Q & Chen, L 2011, 'Clinical significance of miR-221 and its inverse correlation with p27 Kip1 in hepatocellular carcinoma', Molecular Biology Reports, vol. 38, no. 5, pp. 3029-3035.View/Download from: Publisher's site
The aim of the present study is to explore possible role of miR-221 in the pathogenesis of HCC. Matched HCC and adjacent non-cancerous samples were assayed for the expression of miR-221 and three G1/S transition inhibitors: p 27Kip1, p21WAF1/Cip1and TGF-β1 by in situ hybridization and immunohistochemistry respectively. p27Kip1 is one of miR-221's proven targets. Real time qRT-PCR was used to investigate miR-221 and p27Kip1 transcripts in different clinical stages. Western blotting was used to analyze the expression levels of p27Kip1 protein in different clinical stages. In result, miR- 221 and TGF-β1 are frequently up-regulated in HCC, while p27Kip1 and p 21WAF1/Cip1 proteins are frequently downregulated. Moreover, miR-221 and p27Kip1's expression correlated with metastasis and miR-221's expression also correlated with tumor size. Both of p21WAF1/Cip1and TGFβ1' s expression correlated with tumor differentiations. miR-221's upregulation and p27Kip1's downregulation were significantly associated with tumor stages and metastasis. In conclusion, miR-221 is important in tumorigenesis of HCC, possibly by specifically down-regulating p 27Kip1, a cell-cycle inhibitor. These results indicate miR- 221 as a new therapeutic target in HCC.
Yang, T, Kecman, V, Cao, L, Zhang, C & Huang, J 2011, 'Margin-Based Ensemble Classifier For Protein Fold Recognition', Expert Systems with Applications, vol. 38, no. 10, pp. 12348-12355.View/Download from: Publisher's site
Recognition of protein folding patterns is an important step in protein structure and function predictions. Traditional sequence similarity-based approach fails to yield convincing predictions when proteins have low sequence identities, while the taxonom
Traditional data mining research mainly focus]es on developing, demonstrating, and pushing the use of specific algorithms and models. The process of data mining stops at pattern identification. Consequently, a widely seen fact is that 1) many algorithms have been designed of which very few are repeatable and executable in the real world, 2) often many patterns are mined but a major proportion of them are either commonsense or of no particular interest to business, and 3) end users generally cannot easily understand and take them over for business use. In summary, we see that the findings are not actionable, and lack soft power in solving real-world complex problems. Thorough efforts are essential for promoting the actionability of knowledge discovery in real-world smart decision making. To this end, domain-driven data mining (D3M) has been proposed to tackle the above issues, and promote the paradigm shift from ÃÂdata-centered knowledge discoveryÃÂ to ÃÂdomain-driven, actionable knowledge delivery.ÃÂ In D3M, ubiquitous intelligence is incorporated into the mining process and models, and a corresponding problem-solving system is formed as the space for knowledge discovery and delivery. Based on our related work, this paper presents an overview of driving forces, theoretical frameworks, architectures, techniques, case studies, and open issues of D3M. We understand D3M discloses many critical issues with no thorough and mature solutions available for now, which indicates the challenges and prospects for this new topic.
The in-depth analysis of human behavior has been increasingly recognized as a crucial means for disclosing interior driving forces, causes and impact on businesses in handling many challenging issues such as behavior modeling and analysis in virtual organizations, web community analysis, counter-terrorism and stopping crime. The modeling and analysis of behaviors in virtual organizations is an open area. Traditional behavior modeling mainly relies on qualitative methods from behavioral science and social science perspectives. On the other hand, so-called behavior analysis is actually based on human demographic and business usage data, such as churn prediction in the telecommunication industry, in which behavior-oriented elements are hidden in routinely collected transactional data. As a result, it is ineffective or even impossible to deeply scrutinize native behavior intention, lifecycle and impact on complex problems and business issues. In this paper, we propose the approach of behavior informatics (BI), in order to support explicit and quantitative behavior involvement through a conversion from source data to behavioral data, and further conduct genuine analysis of behavior patterns and impacts. BI consists of key components including behavior representation, behavioral data construction, behavior impact analysis, behavior pattern analysis, behavior simulation, and behavior presentation and behavior use. We discuss the concepts of behavior and an abstract behavioral model, as well as the research tasks, process and theoretical underpinnings of BI. Two real-world case studies are demonstrated to illustrate the use of BI in dealing with complex enterprise problems, namely analyzing exceptional market microstructure behavior for market surveillance and mining for high impact behavior patterns in social security data for governmental debt prevention.
Cao, L, Zhao, Y, Zhang, H, Luo, D, Zhang, C & Park, E 2010, 'Flexible Frameworks For Actionable Knowledge Discovery', IEEE Transactions On Knowledge And Data Engineering, vol. 22, no. 9, pp. 1299-1312.View/Download from: Publisher's site
Most data mining algorithms and tools stop at the mining and delivery of patterns satisfying expected technical interestingness. There are often many patterns mined but business people either are not interested in them or do not know what follow-up actio
Xiao, Y, Liu, B, Luo, D, Cao, L, Deng, F & Hao, Z 2010, 'Multi-agent system for customer relationship management with SVMs tool', International Journal of Intelligent Information and Database Systems, vol. 4, no. 2, pp. 121-136.View/Download from: Publisher's site
In this paper, we introduce multiple agents, knowledge discovery and data mining into customer relationship management (CRM) to set up the architecture of a multi-agent-based CRM system (MAB-CRM), and then use the SVMs-based approach to build up the decision support model which can classify the patterns obtained by the multiple agents into several decision levels, so that managers can pursue different decision-making activities according to the decision level of a pattern. Substantial experiments in the two-dimensional space show how the SVMs-based approach works. The practical problem from one Chinese company has been resolved by the SVMs-based approach. The results illustrate that this approach has an effective ability to learn the decision rules from the assessors' experience.
The adaptive local hyperplane (ALH) algorithm is a very recently proposed classifier, which has been shown to perform better than many other benchmarking classifiers including support vector machine (SVM), K-nearest neighbor (KNN), linear discriminant analysis (LDA), and K-local hyperplane distance nearest neighbor (HKNN) algorithms. Although the ALH algorithm is well formulated and despite the fact that it performs well in practice, its scalability over a very large data set is limited due to the online distance computations associated with all training instances. In this paper, a novel algorithm, called ALH-Fast and obtained by combining the classification tree algorithm and the ALH, is proposed to reduce the computational load of the ALH algorithm. The experiment results on two large data sets show that the ALH-Fast algorithm is both much faster and more accurate than the ALH algorithm.
Trading agents are useful for developing and back-testing quality trading strategies to support smart trading actions in the market. However, most of the existing trading agent research oversimplifies trading strategies, and focuses on simulated ones. As a result, there exists a big gap between the deliverables and business needs when the developed strategies are deployed into the real life. Therefore, the actionable capability of developed trading agents is often very limited. This paper for the first time introduces effective approaches for optimizing and integrating multiple classes of strategies through trading agent collaboration. An integration and optimization approach is proposed to identify optimal trading strategy in each category, and further integrate optimal strategies crossing classes. Positions associated with these optimal strategies are recommended for trading agents to take actions in the market. Extensive experiments on a large quantity of real-life market data show that trading agents following the recommended strategies have great potential to obtain high benefits while low costs. This verifies that it is promising to develop trading agents toward workable and satisfying business needs.
Cao, L & Yu, P 2009, 'Behavior Informatics: An Informatics Perspective for Behavior Studies', IEEE Computational Intelligence Bulletin, vol. 10, no. 1, pp. 6-11.
Behavior is increasingly recognized as a key entity in business intelligence and problem-solving. Even though behavior analysis has been extensively investigated in social sciences and behavior sciences, in which qualitative and psychological methods have been the main means, nevertheless to conduct formal representation and deep quantitative analysis it is timely to investigate behavior from the informatics perspective. This article highlights the basic framework of behavior informatics, which aims to supply methodologies, approaches, means and tools for formal behavior modeling and representation, behavioral data construction, behavior impact modeling, behavior network analysis, behavior pattern analysis, behavior presentation, management and use. Behavior informatics can greatly complement existing studies in terms of providing more formal, quantitative and computable mechanisms and tools for deep understanding and use.
Cao, L, Dai, RW & Zhou, M 2009, 'Metasynthesis: M-Space, M-Interaction, and M-Computing for Open Complex Giant Systems', IEEE Transactions On Systems Man And Cyberneti..., vol. 39, no. 5, pp. 1007-1021.View/Download from: Publisher's site
The studies of complex systems have been recognized as one of the greatest challenges for current and future science and technology. Open complex giant systems (OCGSs) are a family of specially complex systems with system complexities such as openness, human involvement, societal characteristic, and intelligence emergence. They greatly challenge multiple disciplines such as system sciences, system engineering, cognitive sciences, information systems, artificial intelligence, and computer sciences. As a result, traditional problem-solving methodologies can help deal with them but are far from a mature solution methodology. The theory of qualitative-to-quantitative metasynthesis has been proposed as a breakthrough and effective methodology for the understanding and problem solving of OCGSs. In this paper, we propose the concepts of M-Interaction, M-Space, and M-Computing which are three key components for studying OCGS and building problem-solving systems. M-Interaction forms the main problem-solving mechanism of qualitative-to-quantitative metasynthesis; M-Space is the OCGS problem-solving system embedded with M-Interactions, while M-Computing consists of engineering approaches to the analysis, design, and implementation of M-Space and M-Interaction. We discuss the theoretical framework, problem-solving process, social cognitive evolution, intelligence emergence, and pitfalls of certain types of cognitions in developing M-Space and M-Interaction from the perspectives of cognitive sciences and social cognitive interaction. These can help one understand complex systems and develop effective problem-solving methodologies.
Autonomous agents and multiagent systems (or agents) and data mining and knowledge discovery (or data mining) are two of the most active areas in information technology. Ongoing research has revealed a number of intrinsic challenges and problems facing each area, which can't be addressed solely within the confines of the respective discipline. A profound insight of bringing these two communities together has unveiled a tremendous potential for new opportunities and wider applications through the synergy of agents and data mining. With increasing interest in this synergy, agent mining is emerging as a new research field studying the interaction and integration of agents and data mining. In this paper, we give an overall perspective of the driving forces, theoretical underpinnings, main research issues, and application domains of this field, while addressing the state-of-the-art of agent mining research and development. Our review is divided into three key research topics: agent-driven data mining, data mining-driven agents, and joint issues in the synergy of agents and data mining. This new and promising field exhibits a great potential for groundbreaking work from foundational, technological and practical perspectives.
On top of two active research streams, agents and data mining, a most recent and exciting trend is their interaction and integration. Agent mining has emerged as a very promising field due to its unique contributions to complementary and innovative methodologies, techniques, and applications for complex problem-solving. This editorial summarizes the structure of this special issue.
Cao, LQ, Wang, XL, Wang, Q, Xue, P, Jiao, XY, Peng, HP, Lu, HW, Zheng, Q, Chen, XL, Huang, XH, Fu, XH & Chen, JS 2009, 'Rosiglitazone sensitizes hepatocellular carcinoma cell lines to 5-fluorouracil antitumor activity through activation of the PPARγ signaling pathway', Acta Pharmacologica Sinica, vol. 30, no. 9, pp. 1316-1322.View/Download from: Publisher's site
Aim: Resistance to 5-fluorouracil (5-FU) is a major cause of chemotherapy failure in advanced hepatocellular carcinoma (HCC).Rosiglitazone, a peroxisome proliferator-activated receptor γ (PPARγ) agonist, has a crucial role in growth inhibition and induction of apoptosis in several carcinoma cell lines. In this study, we examine rosiglitazone-induced sensitization of HCC cell lines (BEL-7402 and Huh-7 cells) to 5-FU.Methods:The 3-(4,5-dimethylthiazol-2- yl)-2,5-diphenyl tetrazolium bromide (MTT) assay was used to evaluate cell viability. Western blotting analysis was performed to detect the protein expression (PPARγ, PTEN, and COX-2) in BEL-7402 cells. Immunohistochemistry staining was used to examine the expression of PTEN in 100 advanced HCC tissues and paracancerous tissues. In addition, small interfering RNA was used to suppress PPARγ, PTEN, and COX-2 expression.Results: Rosiglitazone facilitates the anti-tumor effect of 5-FU in HCC cell lines, which is mediated by the PPARγ signaling pathway. Activation of PPARγ by rosiglitazone increases PTEN expression and decreases COX-2 expression. Since distribution of PTEN in HCC tissues is significantly decreased compared with the paracancerous tissue, over-expression of PTEN by rosiglitazone enhances 5-FU-inhibited cell growth of HCC. Moreover, down-regulation of COX-2 is implicated in the synergistic effect of 5-FU.Conclusion:Rosiglitazone sensitizes hepatocellular carcinoma cell lines to 5-FU antitumor activity through the activation of PPARγ. The results suggest potential novel therapies for the treatment of advanced liver cancer. © 2009 CPS and SIMM.
Chen, JS, Wang, Q, Fu, XH, Huang, XH, Chen, XL, Cao, LQ, Chen, LZ, Tan, HX, Li, W, Bi, J & Zhang, LJ 2009, 'Involvement of PI3K/PTEN/AKT/mTOR pathway in invasion and metastasis in hepatocellular carcinoma: Association with MMP-9', Hepatology Research, vol. 39, no. 2, pp. 177-186.View/Download from: Publisher's site
Aim: To investigate the status of Phosphatidylinositol 3-kinase (PI3K)/PTEN/AKT/mammalian target of rapamycin (mTOR) pathway and its correlation with clinicopathological features and matrix metalloproteinase-2, -9 (MMP-2, 9) in human hepatocellular carcinoma (HCC). Methods: PTEN, Phosphorylated AKT (p-AKT), Phosphorylated mTOR (p-mTOR), MMP-2, MMP-9 and Ki-67 expression levels were evaluated by immunohistochemistry on tissue microarrays containing 200 HCCs with paired adjacent non-cancerous liver tissues. PTEN, MMP-2 and MMP-9 mRNA levels were determined by real-time RT-PCR in 36 HCCs. The relationships between PI3K/PTEN/AKT/mTOR pathway and clinicopathological factors and MMP-2, 9 were analyzed in HCC. Results: In HCC, PTEN loss and overexpression of p-AKT and p-mTOR were associated with tumor grade, intrahepatic metastasis, vascular invasion, TNM stage and high Ki-67 labeling index (P < 0.05). PTEN loss was correlated with p-AKT, p-mTOR and MMP-9 overexpression. Furthermore, PTEN and MMP-2, 9 mRNA levels were down-regulated and up-regulated in HCC compared with paired non-cancerous liver tissues, respectively (P < 0.01). PTEN, MMP-2 and MMP-9 mRNA levels were correlated with tumor stage and metastasis. There was an inverse correlation between PTEN and MMP-9 mRNA expression. However, PI3K/PTEN/AKT/mTOR pathway was not correlated with MMP-2. Conclusions: PI3K/PTEN/AKT/mTOR pathway, which is activated in HCC, is involved in invasion and metastasis through up-regulating MMP-9 in HCC. © 2009 The Japan Society of Hepatology.
Huang, XH, Wang, Q, Chen, JS, Fu, XH, Chen, XL, Chen, LZ, Li, W, Bi, J, Zhang, LJ, Fu, Q, Zeng, WT, Cao, LQ, Tan, HX & Su, Q 2009, 'Bead-based microarray analysis of microRNA expression in hepatocellular carcinoma: MiR-338 is downregulated', Hepatology Research, vol. 39, no. 8, pp. 786-794.View/Download from: Publisher's site
Aim: Recent studies have underlined causative links between microRNA (miRNA) deregulation and cancer development. However, the relevance of abnormally expressed miRNA to tumor biology has not been well understood in hepatocellular carcinoma (HCC). Methods: A bead-based miRNA expression profiling method was performed on 20 pairs of surgically removed HCC and adjacent non-tumorous tissue (NT). Special miR-338 downregulations and miR-338 associated with clinical characteristics was validated in an extended samples set of 36 paired HCC and adjacent non-tumorous liver tissues by real-time reverse transcription polymerase chain reaction (RT-PCR) analysis. Results: Out of our bead-based microarray data, 12 upregulated and 19 downregulated miRNA were found to be associated with HCC. Further characterization of miRNA-338, in which 20 pairs of the samples were clustered clearly into two groups according to expression of miR-338, revealed that the level of miR-338 expression can be associated with clinical aggressiveness, such as, tumor size, tumor-node-metastasis stage, vascular invasion and intrahepatic metastasis. These results were validated by real-time RT-PCR analysis. Conclusion: Our study suggests that miRNA expression could have relevance to the clinical behavior of HCC and that the bead-based miRNA expression profiling method might be a suitable system to assay miRNA expression in large-scale diagnostic trails. © 2009 The Japan Society of Hepatology.
Tran, TP, Cao, L, Tran, D & Nguyen, CD 2009, 'Novel Intrusion Detection using Probabilistic Neural Network and Adaptive Boosting', International Journal of Computer Science and Information Security, vol. 6, no. 1, pp. 83-91.View/Download from: Publisher's site
This article applies Machine Learning techniques to solve Intrusion Detection problems within computer networks. Due to complex and dynamic nature of computer networks and hacking techniques, detecting malicious activities remains a challenging task for security experts, that is, currently available defense systems suffer from low detection capability and high number of false alarms. To overcome such performance limitations, we propose a novel Machine Learning algorithm, namely Boosted Subspace Probabilistic Neural Network (BSPNN), which integrates an adaptive boosting technique and a semi parametric neural network to obtain good tradeoff between accuracy and generality. As the result, learning bias and generalization variance can be significantly minimized. Substantial experiments on KDD 99 intrusion benchmark indicate that our model outperforms other state of the art learning algorithms, with significantly improved detection accuracy, minimal false alarms and relatively small computational complexity.
Tsai, PC, Cao, L, Hintz, TB & Jan, T 2009, 'A bi-modal face recognition framework integrating facial expression with facial appearance', Pattern Recognition Letters, vol. 30, no. 12, pp. 1096-1109.View/Download from: Publisher's site
Among many biometric characteristics, the facial biometric is considered to be the least intrusive technology that can be deployed in the real-world visual surveillance environment. However, in facial biometric, little research attention has been paid to facial expression changes. In fact, facial expression changes have often been treated as noise that would degrade the recognition performance. This paper studies an innovative viewpoint: (1) whether facial expression changes, namely facial behavior, can be positively used for face recognition or not? (2) furthermore, can facial behavior be integrated with facial appearance for assisting the extra-personal separation to enhance face recognition performance? We propose a bi-modal face recognition framework which integrates facial expression with facial appearance. Substantial experiments on multiple facial appearance and facial expression data have been conducted. Our experimental results have validated that facial behavior can play a positive role in face recognition and can assist facial appearance in extra-personal separation in multiple modalities for personal identification improvement.
Zhang, H, Zhao, Y, Cao, L, Zhang, C & Bohlscheid, H 2009, 'Customer Activity Sequence Classification for Debt Prevention in Social Security', Journal Of Computer Science And Technology, vol. 24, no. 6, pp. 1000-1009.View/Download from: Publisher's site
From a data mining perspective, sequence classification is to build a classifier using frequent sequential patterns. However, mining for a complete set of sequential patterns on a large dataset can be extremely time-consuming and the large number of patterns discovered also makes the pattern selection and classifier building very time-consuming. The fact is that, in sequence classification, it is much more important to discover discriminative patterns than a complete pattern set. In this paper, we propose a novel hierarchical algorithm to build sequential classifiers using discriminative sequential patterns. Firstly, we mine for the sequential patterns which are the most strongly correlated to each target class. In this step, an aggressive strategy is employed to select a small set of sequential patterns. Secondly, pattern pruning and serial coverage test are done on the mined patterns. The patterns that pass the serial test are used to build the sub-classifier at the first level of the final classifier. And thirdly, the training samples that cannot be covered are fed back to the sequential pattern mining stage with updated parameters. This process continues until predefined interestingness measure thresholds are reached, or all samples are covered. The patterns generated in each loop form the sub-classifier at each level of the final classifier. Within this framework, the searching space can be reduced dramatically while a good classification performance is achieved. The proposed algorithm is tested in a real-world business application for debt prevention in social security area. The novel sequence classification algorithm shows the effectiveness and efficiency for predicting debt occurrences based on customer activity sequence data.
Many real-life applications often involve multiple sequences, which are coupled with each other. It is unreasonable to either study the multiple coupled sequences separately or simply merge them into one sequence, because the information about their interacting relationships would be lost. Furthermore, such coupled sequences also have frequently significant changes which are likely to degrade the performance of trained model. Taking the detection of abnormal trading activity patterns in stock markets as an example, this paper proposes a Hidden Markov Model-based approach to address the above two issues. Our approach is suitable for sequence analysis on multiple coupled sequences and can adapt to the significant sequence changes automatically. Substantial experiments conducted on a real dataset show that our approach is effective.
Cao, L 2008, 'An integrated investment decision-support framework analysing and synthesising multidimensional market dynamics', International Journal of Intelligent Systems Technologies and Applications, vol. 4, no. 3/4, pp. 239-253.View/Download from: Publisher's site
In stock markets, the performance of traditional technology-based investment methods is limited because such methods only take into account single-dimensional market dynamics. The paper shows how the integration of multi-dimensional dynamics can improve performance. We propose a novel three-layer integrated framework composed of Analysis, Synthesis, and Investment Decision Support. At the first layer, multi-dimensional market dynamics are identified, in which we emphasise two key aspects that previous studies have neglected: unique trends of stocks, and a two-way reflexivity relationship of investors' decisions and market reactions. At the second layer, multi-dimensional dynamics are synthesized to reflect real and potential market situations. At the third layer, a prototype integrates the functions of first two layers for investment decision support. The framework covers multi-dimensional dynamics, and incorporates the concepts and advantages of traditional investment methods. The framework is promising, and our experimental results indicated that it outperformed market baselines and single-dimensional conventional methods.
Cao, L 2008, 'Integrating Agent, Service and Organizational Computing', International Journal of Software Engineering and Knowledge Engineering, vol. 18, no. 5, pp. 573-596.View/Download from: Publisher's site
Engineering open complex systems is challenging because of system complexities such as openness, the involvement of organizational factors and service delivery. It cannot be handled well by the single use of existing computing techniques such as agent-based computing and service-oriented computing. Due to the intrinsic organizational characteristics and the request of service delivery, an integrative computing paradigm combining agent, service, organizational and social computing can open complex systems more effectively engineering. In this paper, we briefly introduce an integrative computing approach named OASOC for system analysis and design. It combines and complements the strengths of agent, service and organizational computing to handle the complexities of open complex systems. OASOC provides facilities for organization-oriented analysis and agent service-oriented design. It also supports transition between analysis and design. Compared with the existing approaches, our approach can (1) support service and organization that are either rarely or weakly covered by single computing methods, (2) provide effective mechanisms to integrate agent, service and organizational computing, and (3) complement the strengths of various methods. Experiences in engineering an online trading support system have further shown the workable capability of integrating agent, service and organizational computing for engineering open complex systems.
Cao, L & Nguyen, NT 2008, 'Intelligence Metasynthesis and Knowledge Processing in Intelligent Systems', Journal of Universal Computer Science, vol. 14, no. 14, pp. 2256-2262.
Intelligence and Knowledge play more and more important roles in building complex intelligent systems, for instance, intrusion detection systems, and operational analysis systems. Knowledge processing in complex intelligent systems faces new challenges from the increased number of applications and environment, such as the requirements of representing domain and human knowledge in intelligent systems, and discovering actionable knowledge on a large scale in distributed web applications. In this paper, we discuss the main challenges of, and promising approaches to, intelligence metasynthesis and knowledge processing in open complex intelligent systems. We believe (1) ubiquitous intelligence, including data intelligence, domain intelligence, human intelligence, network intelligence and social intelligence, is necessary for OCIS, which needs to be meta-synthesized; and (2) knowledge processing should pay more attention to developing innovative and workable methodologies, techniques, tools and systems for representing, modelling, transforming, discovering and servicing the uncertain, large-scale, deep, distributed, domain-oriented, human-involved, and actionable knowledge highly expected in constructing open complex intelligent systems. To this end, the meta-synthesis of ubiquitous intelligence is an appropriate way in designing complex intelligent systems. To support intelligence meta-synthesis, m-interaction can play as the working mechanism to form m-spaces as problem-solving systems. In building such m-spaces, advancement in knowledge processing is necessary.
Cao, L & Nguyen, NT 2008, 'Knowledge processing in intelligent systems J.UCS special issue', Journal of Universal Computer Science, vol. 14, no. 14, p. 2255.
Cao, L, Zhang, C & Zhou, M 2008, 'Engineering open complex agent systems: A case study', IEEE Transactions On Systems Man And Cybernetics Part C-Applications And Reviews, vol. 38, no. 4, pp. 483-496.View/Download from: Publisher's site
Open complex agent systems (OCAS) are becoming increasingly important in constructing problem-solving systems for enterprise applications. are challenging because they present very high system complexities involving human users and interactions with a ch
Cao, L, Zhao, Y & Zhang, C 2008, 'Mining impact-targeted activity patterns in imbalanced data', IEEE Transactions On Knowledge And Data Engineering, vol. 20, no. 8, pp. 1053-1066.View/Download from: Publisher's site
Impact-targeted activities are rare but they may have a significant impact on the society. For example, isolated terrorism activities may lead to a disastrous event, threatening the national security. Similar issues can also be seen in many other areas.
Cao, L, Zhao, Y, Zhang, C & Zhang, H 2008, 'Activity mining: From activities to actions', International Journal Of Information Technology & Decision Making, vol. 7, no. 2, pp. 259-273.View/Download from: Publisher's site
Activity data accumulated in real life, such as terrorist activities and governmental customer contacts, present special structural and semantic complexities. Activity data may lead to or be associated with significant business impacts, and result in important actions and decision making leading to business advantage. For instance, a series of terrorist activities may trigger a disaster to society, and large amounts of fraudulent activities in social security programs may result in huge government customer debt. Uncovering these activities or activity sequences can greatly evidence and/or enhance corresponding actions in business decisions. However, mining such data challenges the existing KDD research in aspects such as unbalanced data distribution and impact-targeted pattern mining. This paper investigates the characteristics and challenges of activity data, and the methodologies and tasks of activity mining based on case-study experience in the area of social security. Activity mining aims to discover high impact activity patterns in huge volumes of unbalanced activity transactions. Activity patterns identified can be used to prevent disastrous events or improve business decision making and processes. We illustrate the above issues and prospects in mining governmental customer contacts data to recover customer debt.
Chen, W, Cao, L & Qin, Z 2008, 'An integrated investment decision-support framework analysing and synthesising multidimensional market dynamics', International Journal of Intelligent Systems Technologies and Applications, vol. 4, no. 3-4, pp. 239-253.View/Download from: Publisher's site
In stock markets, the performance of traditional technology-based investment methods is limited because such methods only take into account single-dimensional market dynamics. The paper shows how the integration of multi-dimensional dynamics can improve performance. We propose a novel three-layer integrated framework composed of Analysis, Synthesis, and Investment Decision Support. At the first layer, multi-dimensional market dynamics are identified, in which we emphasize two key aspects that previous studies have neglected: unique trends of stocks, and a two-way reflexivity relationship of investors' decisions and market reactions. At the second layer, multi-dimensional dynamics are synthesized to reflect real and potential market situations. At the third layer, a prototype integrates the functions of first two layers for investment decision support. The framework covers multi-dimensional dynamics, and incorporates the concepts and advantages of traditional investment methods. The framework is promising, and our experimental results indicated that it outperformed market baselines and single-dimensional conventional methods. © 2008 Inderscience Enterprises Ltd.
Chen, XL, Cao, LQ, She, MR, Wang, Q, Huang, XH & Fu, XH 2008, 'Gli-1 siRNA induced apoptosis in Huh7 cells', World Journal of Gastroenterology, vol. 14, no. 4, pp. 582-589.View/Download from: Publisher's site
Aim: To investigate the effects of Gli-1 small interference RNA (siRNA) on Huh7 cells, and the change of Bcl-2 expression in Huh7 cells. Methods: Human hepatocellular carcinoma cells Huh7 were used. Cell viability was analyzed by 3-(4, 5-Dimethylthiazol-2-yl)-2, 5-diphenyl tetrazolium bromide (MTT) assay. The expressions of Gli-1 and Bcl-2 family members were detected by RT-PCR and Western blot. Apoptosis was detected by Flow cytometry using propidium iodide, measured by Hoechst 33258 staining using Advanced Fluorescence Microscopy and caspase-3 enzymatic assay. Cell growth was analyzed after treatment with Gli-1 siRNA and 5-fluorouracil (5-Fu). Results: Inhibition of Gli-1 mRNA in Huh7 cells through Gli-1 siRNA reduced cell viability. Gli-1 siRNA treatment also induced apoptosis by three criteria, increase in the sub-G1 cell cycle fraction, nuclear condensation, a morphologic change typical of apoptosis, and activation of caspase-3. Gli-1 siRNA was also able to down-regulate Bcl-2. However, Gli-1 siRNA resulted in no significant changes in Bcl-xl, Bax, Bad, and Bid. Furthermore, Gli-1 siRNA increased the cytotoxic effect of 5-Fu on Huh7 cell. Conclusion: Down-regulation of Bcl-2 plays an important role in apoptosis induced by Gli-1 siRNA in HCC cells. Combination Gli-1 siRNA with chemotherapeutic drug could represent a more promising strategy against HCC. The effects of the strategies need further investigation in vivo and may have potential clinical application. © 2008 WJG. All rights reserved.
Chen, XL, Wang, Q, Cao, LQ, Huang, XH, Fu, XH, Tan, HX, Zhen, MC & Chen, JS 2008, 'Epigallocatechin-3-gallate induces apoptosis in human hepatocellular carcinoma cells', National Medical Journal of China, vol. 88, no. 36, pp. 2524-2528.
Objective: To investigate the effects of epigallocatechin-3-gallate (EGCG) on human hepatocellular carcinoma (HCC) cells and mechanism thereof. Method: Human HCC cells of the lines HepG2 and SMMC-7721 were cultured and treated with of EGCG of the concentrations of 6.25, 12.5, 25, 50, 100, 200, and 400 μg/ml respectively for 24 h and 48 h. The cell viability was assessed by 3-(4, 5-dimethylthiazol-2-yl)-2, 5-diphenyl tetrazolium bromide (MTT) assay. Trypan blue staining was used to count the cells. Flow cytometry was conducted to detect the cell apoptosis. The protein levels of Bcl-2, an anti-apoptosis factor, and cyclooxygenase-2 (COX-2), an up-regulator of Bcl-2. The activities of caspase-9 and caspase-3 hat promote the apoptosis of HCC cells, were measured using colorimetric method. RT-PCR was used to detect the mRNA expression of COX-2 and Bcl-2 family. Results: The viabilities of the HepG2 and SMMC-7721 cells treated with EGCG of the concentrations of 50 - 400 μg/ml for 48 h reduced to 93.8% ± 2.8%, 62.3% ± 5.4% 33.9% ± 2.5%, and 17.6% ± 3.2% respectively, all significantly lower than that of the control group [(100.0% ± 2.8%), all P < 0.05]; and the viabilities of the SMMC-772 cells treated with EGCG of the concentrations of 50 - 400 μg/ml for 48 h reduced to 49.6% ± 3.5%, 30.3% ± 3.8%, 17.7% ± 2.2%, and 13.0% ± 2.5% respectively, all significantly lower than that of the control group [(100.0% ± 0.8%), all P < 0.05]. After treatment with 100 μg/ml EGCG for 24 h, 48 h, 72 h, and 96 h, the live HepG2 cell numbers were (8.0 ± 1.5), (22.0 ± 3.1), (37.0 ± 5.4), and (61.0 ± 8.7) 104 respectively, all significantly lower than those of the control cells [(15.0 ± 2.5), (45.0 ± 5.3), (86.0 ± 11.0), and (210.0 ± 23.0) 104 respectively, all P < 0.05]; and the live SMMC-7721 cell numbers were (7.0 ± 2.2), (13.0 ± 2.5), (20.0 ± 3.7), and (31.0 ± 4.0) 104 respectively, all significantly lower than those of the control cells [(15.0 ± 2.5), (45.0 ± 5.3), (86.0 ± 11.0), and (...
Fu, X, Wang, Q, Chen, X, Huang, X, Cao, L, Tan, H, Li, W, Zhang, L, Bi, J, Su, Q & Chen, L 2008, 'Expression patterns and polymorphisms of PTCH in Chinese hepatocellular carcinoma patients', Experimental and Molecular Pathology, vol. 84, no. 3, pp. 195-199.View/Download from: Publisher's site
Aberrant activation of Hedgehog signaling pathway leads to pathological consequences in a variety of human tumors. PTCH (PTCH1), the receptor of Hedgehog pathway, is reported to function as a gatekeeper in tumor formation. Here we report, by semi-quantitative RT-PCR, PTCH expression was found in 38 hepatocellular carcinoma (HCC) patients (66%). Evidences from real time quantitative RT-PCR further indicate that compared to their matched nontumorous liver tissue, PTCH exhibit a higher expression in well and moderate differentiated tumor, but a lower expression in poorly differentiated tumor. Immunohistochemical staining showed PTCH protein was detected in the cytoplasm of 56.3% HCC samples (9/16). For the first time, we investigate the polymorphisms of PTCH in HCC. First we sequenced the recognized mutation hot spots regions of PTCH of 38 HCC samples. Two previously reported single nucleotide polymorphisms (SNPs) and a novel SNP A1056G were identified. Then we examined these three SNPs in 171 HCC samples and 162 normal liver samples. However, statistic analysis showed none of these SNPs was statistically significant for association with HCC. In conclusion, our data suggest that PTCH is involved in early stage tumor development and the Hh pathway in Chinese HCC is activated by ligand expression but not by mutation. © 2008 Elsevier Inc. All rights reserved.
Lin, L & Cao, L 2008, 'Mining in-depth patterns in stock market', International Journal of Intelligent Systems Technologies and Applications, vol. 4, no. 3/4, pp. 225-238.View/Download from: Publisher's site
Stock trading plays an important role for supporting profitable stock investment. In particular, more and more data mining-based technical trading rules have been developed and used in stock trading systems to assist investors with their smart trading decisions. However, many mined trading rules are of no interest to traders and brokers because they are discovered based on statistical significance without checking traders' interestingness concerns. To this end, this paper proposes in-depth data mining technologies to overcome the disadvantages of current data mining methods. We implement a decision support in-depth trading pattern discovery system with Robust Genetic Algorithms (RGA). The system integrates expert knowledge and considers domain constraints into the trading rule development. We further utilise this technique to mine actionable stock-rule pairs targeting behaviour with high return at low risk. The proposed approaches are tested in real stock orderbook data with varying investment strategies.
Luo, D, Cao, L, Luo, C, Zhang, C & Wang, W 2008, 'Towards business interestingness in actionable knowledge discovery', Frontiers in Artificial Intelligence and Applications, vol. 177, no. 1, pp. 99-109.View/Download from: Publisher's site
From the evolution of developing a pattern interestingness perspective, data mining has experienced two phases, which are Phase 1: technical objective interestingness focused research, and Phase 2: technical objective and subjective interestingness focused studies. As a result of these efforts, patterns mined are of significant interest to technical concern. However, technically interesting patterns are not necessarily of interest to business. In fact, real-world experience shows that many mined patterns, which are interesting from the perspective of the data mining method used, are out of business expectations when they are delivered to the final user. This scenario actually involves a grand challenge in next-generation KDD (Knowledge Discovery in Databases) studies, defined as actionable knowledge discovery. To discover knowledge that can be used for taking actions to business advantages, this paper addresses a framework that extends the evolution process of knowledge evaluation to Phase 3 and Phase 4. In Phase 3, concerns with objective interestingness from a business perspective are added on top of Phase 2, while in Phase 4 both technical and business interestingness should be satisfied in terms of objective and subjective perspectives. The introduction of Phase 4 provides a comprehensive knowledge actionability framework for actionable knowledge discovery. We illustrate applications in governmental data mining showing that the considerations and adoption of the framework described in Phase 4 has potential to enhance both sides of interestingness and expectation. As a result, knowledge discovered has better chances to support action-taking in the business world. © 2008 The authors and IOS Press. All rights reserved.
Ni, J, Cao, L & Zhang, C 2008, 'Evolutionary optimization of trading strategies', Frontiers in Artificial Intelligence and Applications, vol. 177, no. 1, pp. 11-24.View/Download from: Publisher's site
It is a non-trivial task to effectively and efficiently optimize trading strategies, not to mention the optimization in real-world situations. This paper presents a general definition of this optimization problem, and discusses the application of evolutionary technologies (genetic algorithm in particular) to the optimization of trading strategies. Experimental results show that this approach is promising. © 2008 The authors and IOS Press. All rights reserved.
Xiao, Y, Liu, B & Cao, L 2008, 'A Chinese question classification using one-vs-one method as a learning tool', International Journal of Intelligent Information and Database Systems, vol. 2, no. 4, pp. 446-459.View/Download from: Publisher's site
Question classification plays an important role in the question answering system and the errors of question classification will probably result in the failure of question answering. Thus, how to enhance the accuracy is an open question. In order to enhance the accuracies of the Chinese question classification, this paper extends one-against-one method based on SVMs to resolve the problems. The results show the good performance of the algorithm for Chinese question classification problems. © 2008, Inderscience Publishers.
Market Surveillance plays important mechanism roles in constructing market models. From data analysis perspective, we view it valuable for smart trading in designing legal and profitable trading strategies and smart regulation in maintaining market integrity, transparency and fairness. The existing trading pattern analysis only focuses on interday data which discloses explicit and high-level market dynamics. In the mean time, the existing market surveillance systems available from large exchanges are facing crucial challenges of diversified, dynamic, distributed and cyber-based misuse, mis-disclosure and misdealing of information, announcement and orders in one market or crossing multiple markets. Therefore, there is a crucial need to develop innovative and workable methods for smart trading and surveillance. To deal with such issues, we propose the innovative concept microstructure pattern analysis and corresponding approaches in this paper. Microstructure pattern analysis studies trading behaviour patterns of traders in market microstructure data by utilizing market microstructure knowledge. The identified market microstructure patterns are then used for powering market trading and surveillance agents for automatically detecting/designing profitable and legal trading strategies or monitoring abnormal market dynamics and traderÂs behaviour. Such trading/surveillance agent-driven market trading/surveillance systems can greatly enhance the analytical, discovery and decision-support capability of market trading/surveillance than the current predefined rule/alert-based systems.
Cao, L 2007, 'Domain-driven Data Mining: A Framework', IEEE Intelligent Systems, vol. 22, no. 4, pp. 78-79.
Data mining increasingly faces complex challenges in the real-life world of business problems and needs. The gap between business expectations and R&D results in this area involves key aspects of the field, such as methodologies, targeted problems, pattern interestingness, and infrastructure support. Both researchers and practitioners are realizing the importance of domain knowledge to close this gap and develop actionable knowledge for real user needs
Cao, L & Zhang, C 2007, 'The Evolution of KDD: Towards Domain-Driven Data Mining', International Journal of Pattern Recognition and Artificial Intelligence, vol. 21, no. 4, pp. 677-692.View/Download from: Publisher's site
Traditionally, data mining is an autonomous data-driven trial-and-error process. Its typical task is to let data tell a story disclosing hidden information, in which domain intelligence may not be necessary in targeting the demonstration of an algorithm. Often knowledge discovered is not generally interesting to business needs. Comparably, real-world applications rely on knowledge for taking effective actions. In retrospect of the evolution of KDD, this paper briefly introduces domain-driven data mining to complement traditional KDD. Domain intelligence is highlighted towards actionable knowledge discovery, which involves aspects such as domain knowledge, people, environment and evaluation. We illustrate it through mining activity patterns in social security data.
Cao, L, Luo, D & Zhang, C 2007, 'Knowledge actionability: satisfying technical and business interestingness', International Journal of Business Intelligence and Data Mining, vol. 2, no. 4, pp. 496-514.View/Download from: Publisher's site
Traditionally, knowledge actionability has been investigated mainly by developing and improving technical interestingness. Recently, initial work on technical subjective interestingness and business-oriented profit mining presents general potential, while it is a long-term mission to bridge the gap between technical significance and business expectation. In this paper, we propose a two-way significance framework for measuring knowledge actionability, which highlights both technical interestingness and domain-specific expectations. We further develop a fuzzy interestingness aggregation mechanism to generate a ranked final pattern set balancing technical and business interests. Real-life data mining applications show the proposed knowledge actionability framework can complement technical interestingness while satisfy real user needs.
Cao, LQ, Chen, XL, Wang, Q, Huang, XH, Zhen, MC, Zhang, LJ, Li, W & Bi, J 2007, 'Upregulation of PTEN involved in rosiglitazone-induced apoptosis in human hepatocellular carcinoma cells', Acta Pharmacologica Sinica, vol. 28, no. 6, pp. 879-887.View/Download from: Publisher's site
Aim: To investigate the effects of rosiglitazone, a peroxisome proliferator-activated receptor gamma (PPARγ) agonist, on the expression of the phosphatase and tensin homologue deleted on chromosome 10 gene (PTEN) and cell growth in hepatocellular carcinoma cells, as well as the underlying mechanisms of these effects. Methods: RT-PCR and Western blotting analyses were performed to detect transcription and the expression of PTEN in Hep3B cells treated with rosiglitazone. 3-(4,5-Dimethylthiazol-2-yl)-2,5-diphenyl tetrazolium bromide assay was used to evaluate cell growth. Flow cytometry, DNA fragmentation analysis, caspase enzymatic assay, and Hoechst 33258 staining were used to determine cell apoptosis. Furthermore, small interfering RNA was used to suppress PTEN expression. Results: Rosiglitazone increased the expression of PTEN in a dose-and time-dependent manner through the PPARγ-dependent signal transduction pathway. PTEN upregulation was concomitant with a decreased level of Akt phosphorylation, subsequently resulting in cell growth inhibition and apoptosis in Hep3B cells. PTEN knockdown dramatically blocked these effects of rosiglitazone. Moreover, the exposure of cells to rosiglitazone activated caspases-9 and -3 during apoptotic proceeding. Conclusion: Thus, upregulation of PTEN is involved in the inhibition of cell growth and the induction of cell apoptosis by rosiglitazone, suggesting that rosiglitazone may be useful in liver cancer therapy via apoptosis. © 2007 CPS and SIMM.
Cao, LQ, Wang, Q, Chen, XL, Zhen, MC, Fu, XH & Huang, XH 2007, '15-deoxy-Δ12,14 -prostaglandin J2 induces anoikis of hepatocellular carcinoma cells: An in vitro experiment', National Medical Journal of China, vol. 87, no. 42, pp. 3001-3005.
Objective: To investigate the effect of 15-deoxy-Δ12,14- prostaglandin J2 (15-d-PGJ2) on the anoikis of hepatocellular carcinoma (HC) cells and mechanisms thereof. Methods: Fibronectin or polyhydroxyethylmethacrylate (poly-HEMA) were coated onto tissue culture plates, cell growth status and morphological changes were detected by optical microscope. DNA fragmentation analysis and Flow cytometry were used to measure cell apoptotic activity. Western blotting analysis was performed to detect the levels of focal adhesion kinase (FAK) and phosphorated FAK (p-FAK). Furthermore, small interfering RNA (siRNA) was used to suppress FAK expression. Results: The adhesion rate of the BEL-7402 cells treated with 15-d-PGJ2 began to decrease 12 h after the treatment, time- and dose-dependently compared with the HC cell control group (all P < 0.05); when the concentration of 15-d-PGJ2 was 20 μmol/L, the adherent cells ratio at 24 h and 48 h later were (66.0 ± 3.6)% and (35.0 ± 5.0)% respectively. Anoikis of BEL-7402 cells was observed by flow cytometry and DNA fragmentation analysis. Western blotting showed that the p-FAK level of the BEL-7402 cells treated with 15-d-PGJ2 for 24 h decreased dose-dependently, however, the total FAK protein did not change. Conclusion: 15-d-PGJ2 induces anoikis and decreases the phosphorylated FAK expression of the hepatocellular carcinoma cells.
In this paper, we propose a novel framework to deal with data imbalance in class association rule mining. In each class association rule, the right-hand is a target class while the left-hand may contain one or more attributes. This framework is focused on the multiple imbalanced attributes on the left-hand. In the proposed framework, the rules with and without imbalanced attributes are processed in parallel. The rules without imbalanced attributes are mined through standard algorithm while the rules with imbalanced attributes are mined based on new defined measurements. Through simple transformation, these measurements can be in a uniform space so that only a few parameters need to be specified by user. In the case study, the proposed algorithm is applied into social security field. Although some attributes are severely imbalanced, the rules with minority of the imbalanced attributes have been mined efficiently.
Zhen, MC, Wang, Q, Huang, XH, Cao, LQ, Chen, XL, Sun, K, Liu, YJ, Li, W & Zhang, LJ 2007, 'Green tea polyphenol epigallocatechin-3-gallate inhibits oxidative damage and preventive effects on carbon tetrachloride-induced hepatic fibrosis', Journal of Nutritional Biochemistry, vol. 18, no. 12, pp. 795-805.View/Download from: Publisher's site
The aim of the study was to examine the effects of epigallocatechin-3-gallate (EGCG) on hepatic fibrogenesis and on cultured hepatic stellate cells (HSCs). The rat model of carbon tetrachloride (CCl4)-induced hepatic fibrosis was used to assess the effect of daily intraperitoneal injections of EGCG on the indexes of fibrosis. Histological and hepatic hydroxyproline examination revealed that EGCG significantly arrested progression of hepatic fibrosis. EGCG caused significant amelioration of liver injury (reduced activities of serum alanine aminotransferase and aspartate aminotransferase). The development of CCl4-induced hepatic fibrosis altered the redox state with a decreased hepatic glutathione and increased the formation of lipid peroxidative products, which were partially normalized by treatment with EGCG, respectively. Moreover, EGCG markedly attenuated HSC activation as well as matrix metalloproteinase (MMP)-2 activity. In cultured stellate cell, the expression of MMP-2 mRNA and protein were substantially reduced by EGCG treatment. Concanavalin A-induced activation of secreted MMP-2 was inhibited by EGCG through the influence of membrane type 1-MMP activity. These results demonstrate that administration of EGCG may be useful in the treatment and prevention of hepatic fibrosis. © 2007 Elsevier Inc. All rights reserved.
Extant data mining is based on data-driven methodologies. It either views data mining as an autonomous data-driven, trial-and-error process or only analyzes business issues in an isolated, case-by-case manner. As a result, very often the knowledge discovered generally is not interesting to real business needs. Therefore, this article proposes a practical data mining methodology referred to as domain-driven data mining, which targets actionable knowledge discovery in a constrained environment for satisfying user preference. The domain-driven data mining consists of a DDID-PD framework that considers key components such as constraint-based context, integrating domain knowledge, human-machine cooperation, in-depth mining, actionability enhancement, and iterative refinement process. We also illustrate some examples in mining actionable correlations in Australian Stock Exchange, which show that domain-driven data mining has potential to improve further the actionability of patterns for practical use by industry and business.
The integration of Business Intelligence (BI) has been taken bybusiness decision-makers as an effective means to enhance enterprise "soft power" and added value in the reconstruction and revolution oftraditional industries. The existing solutions based on structuralintegration are to pack together data warehouse (DW), OLAP, data mining(DM) and reporting systems from different vendors. BI system users arefinally delivered a reporting system in which reports, data models,dimensions and measures are predefined by system designers. As aresult of a survey in the US, 85% of DW projects based on the above solutions failed to meet their intended objectives. In this paper, wesummarize our investigation on the integration of BI on the basis ofsemantic integration and structural interaction. Ontology-basedintegration of BI is discussed for semantic interoperability inintegrating DW, OLAP and DM. A hybrid ontological structure isintroduced which includes conceptual view, analytical view and physicalview. These views are matched with user interfaces, DW and enterpriseinformation systems, respectively. Relevant ontological engineeringtechniques are developed for ontology namespace, semantic relationships,and ontological transformation, mapping and query in this ontologicalspace. The approach is promising for business-oriented, adaptive andautomatic integration of BI in the real world. Operational decisionmaking experiments within a telecom company have demonstrated that a BI system utilizing the proposed approach is more flexible.
Zhen, MC, Huang, XH, Wang, Q, Sun, K, Liu, YJ, Li, W, Zhang, LJ, Cao, LQ & Chen, XL 2006, 'Green tea polyphenol epigallocatechin-3-gallate suppresses rat hepatic stellate cell invasion by inhibition of MMP-2 expression and its activation', Acta Pharmacologica Sinica, vol. 27, no. 12, pp. 1600-1607.View/Download from: Publisher's site
Aim: Epigallocatechin-3-gallate (EGCG) is the major component of green tea polyphenols, whose wide range of biological properties includes anti-fibrogenic activity. Matrix metalloproteinases (MMP) that participate in extracellular matrix degradation are involved in the development of hepatic fibrosis. The present study investigates whether EGCG inhibits activation of the major gelatinase matrix metalloproteinase-2 (MMP-2) in rat hepatic stellate cells (HSC). Methods: The expression of MMP-2, tissue inhibitors of metalloproteinases-2 (TIMP-2), and membrane-type 1-MMP (MT1-MMP) was assessed by RT-PCR and Western blot analyses. MMP-2 activity was evaluated by zymography and MT 1-MMP activity was assessed by an enzymatic assay. HSC migration was measured by a wound healing assay and cell invasion was performed using Transwell cell culture chambers. Results: The expression of MMP-2 mRNA and protein in HSC was substantially reduced by EGCG treatment. EGCG treatment also reduced con-canavalin A (ConA)-induced activation of secreted MMP-2 and reduced MT1-MMP activity in a dose-dependent manner. In addition, EGCG inhibited either HSC migration or invasion. Conclusion: The abilities of EGCG to suppress MMP-2 activation and HSC invasiveness suggest that EGCG may be useful in the treatment and prevention of hepatic fibrosis. © 2006 CPS and SIMM.
Organization-oriented analysis acts as the key step and foundation in building organization-oriented methodology (OOM) to engineer multi-agent systems especially open complex agent systems (OCAS). A number of existing approaches target OOM, while they are incompatible with each other, and none of them is available as a solid and practical tool for engineering OCAS. This paper summarizes our investigation in building a unified framework for abstracting and analyzing OCAS organizations. Our organizationoriented framework, referred to as ORGANISED, integrating and expanding existing approaches, explicitly captures the main attributes in an OCAS. Following this framework, individual modelbuilding blocks are developed for all ORGANISED members; both visual and formal specifications are utilized to present an intuitive and precise analysis . The above techniques have been deployed in developing an agent service-based trading and mining support infrastructure.
Open complex agent systems (OCAS) are middle-size or large-scale open agent organization. To engineer OCAS, agentcentric organization-oriented analysis, design and implementation, namely organization-oriented methodology (OOM), has emerged as a highly promising direction. A number of OOM-related approaches have been proposed; while there are some intrinsic issues hidden in them. For instance, some fundamental system attributes, such as system dynamics, are not covered by almost all of the existing approaches. In this paper, we summarize our investigation of existing approaches, and report a new OOM approach called OSOAD. The OSOAD approach consists of organizational abstraction (OA), organization-oriented analysis (OOA), agent service-oriented design (ASOD), and Java agent service -based implementation. OSOAD provides complete and deployable mechanisms for all software engineering phases. Specifically, we notice the transition supports from OA to OOA and ASOD. This approach has been built and deployed with the practical development of agent service -based financial trading and mining applications.
Luo, D, Liu, W, Luo, C, Cao, L & Dai, RW 2005, 'Hybrid Analyses and System Architecture for Telecom Frauds', Jisuanji Kexue (Computer Science), vol. 32, no. 5, pp. 17-22.
This paper tells a story of synergism of two cutting edge technologies - agents and data mining. By integrating these two technologies, the power for each of them is enhanced. Integrating agents into data mining systems, or constructing data mining syste
Cao, L & Dai, R 2003, 'Agent-Oriented Approach for Dealing with Open Giant Intelligent Systems', Moshi Shibie yu Rengong Zhineng - Journal of Pattern Recognition and Artificial Intelligence, vol. 15, no. 3, pp. 75-81.
Cao, L & Dai, R 2003, 'Human-Computer Cooperated Intelligent Information System Based on Multi-Agents', Zidonghua Xuebao - Acta Automatica Sinica, vol. 29, no. 1, pp. 86-94.
The Hall for Workshop of Metasynthetie Engineering(HWME) is an engeering technology proposed for coping with open complex giant systems. In this paper we describe the implementation of a human-computer-cooperated intelligent information system with HWME and multiagents. We propose a layered model, a system structure over the network, and a distributed computing model--an n-tier client/agent/server-nested Requester-Mediator-Provider--for building the system. Furthermore, we discuss the framework and working mechanisms of an agent-based system of HWME, which is designed for macroeconomic decision-support based on intelligent information agents in Java. Our system implementation shows that an agent-oriented HWME system over the Internet may exhibit better performance in terms of handling open complex problems
Cao, L & Dai, R 2003, 'On Metasynthesis and Decision Making', Jisuanji Yanjiu yu Fazhan - Journal of Computer Research and Development, vol. 40, no. 1, pp. 531-537.
Cao, L & Dai, RW 2003, 'Agent-oriented Metasynthetic Engineering for Decision making', International Journal of Information Technology and Decision Making, vol. 2, no. 2, pp. 197-215.View/Download from: Publisher's site
Cao, LB & Dai, RW 2003, 'On metasynthesis and decision making', Jisuanji Yanjiu yu Fazhan/Computer Research and Development, vol. 40, no. 4, p. 531.
Dai, R & Cao, L 2003, 'Internet----An Open Complex Giant System', Science in China Series E: Technological Sciences, vol. 33, no. 4, pp. 289-296.
Cao, L & Dai, R 2002, 'Agent-oriented approach for dealing with open giant intelligent systems', Moshi Shibie yu Rengong Zhineng/Pattern Recognition and Artificial Intelligence, vol. 15, no. 3, p. 257.
Cao, L & Dai, R 2002, 'Software Architecture of the Hall for Workshop of Metasynthetic Engineering', Ruanjian Xuebao - Journal of Software, vol. 13, no. 8, pp. 1430-1435.
Cao, L, Nan, J & Dai, R 2002, 'Intelligent Mobile Agents for Distributed Information Integration', Xitong Fangzhen Xuebao - Journal of System Simulation, vol. 14, no. 11, pp. 1517-1520.
Cao, LB & Dai, RW 2001, 'Information system of metasynthetic wisdom-Internet', Moshi Shibie yu Rengong Zhineng/Pattern Recognition and Artificial Intelligence, vol. 14, no. 1, p. 1.
Cao, L, 'AI in FinTech: A Research Agenda'.
Smart FinTech has emerged as a new area that synthesizes and transforms AI
and finance, and broadly data science, machine learning, economics, etc. Smart
FinTech also transforms and drives new economic and financial businesses,
services and systems, and plays an increasingly important role in economy,
technology and society transformation. This article presents a highly
summarized research overview of smart FinTech, including FinTech businesses and
challenges, various FinTech-associated data and repositories, FinTech-driven
business decision and optimization, areas in smart FinTech, and research
methods and techniques for smart FinTech.
© Springer-Verlag London 2015. This chapter illustrates the issues in mining complex data and problems for knowledge that will support decision-making actions and shows how complex problems are analyzed through consideration of the concepts and thinking in metasynthetic computing. Typically, mining complex data to deliver actionable knowledge is becoming increasingly challenging. These challenges arise from the following issues: 1.The limitations of existing KDD methodologies and systems, such as purely data-driven techniques or the poor involvement of business and domain coupling with data.2.The characteristics and challenges of complex data in the real world, which involve many different kinds of complexities as discussed in Chaps. 1, 2, 3, and 4.3.Arguably, methodologies and systems available in the current KDD literature rarely present a systematic and comprehensive guide from system sciences and multidisciplinary aspects. These instead play an important role in the study of open complex systems.4.The majority of the existing work focuses on mining simple and manipulated data and problems, which are abstracted from the complexities of real-life problems, and we therefore face critical challenges in addressing real problems and their complexities. Real-life problems present as complex systems, and taking a systematic and comprehensive view is thus very important.
© Springer-Verlag London 2015. Architectural design is a crucial stage in software design. It aims to build an overall picture of what a problem-solving system for tackling a complex problem looks like by focusing on architectural aspects. These aspects involve basic computing unit definition and functionalities; design patterns; architectures for integrating related resources and applications; a system architecture for integrating and supporting computing units, resources, applications, and their interactions; integration strategies; and the communication, coordination, and management of computing units, resources, and applications.
© Springer-Verlag London 2015. The previous chapter introduced agent services-oriented architectural design. In this chapter, detailed design is addressed. We first describe agent service ontology, which is key to the representation of agent services aiming for generic and consistent application. Endpoint interfaces for agent service are also described. Management work on directory, communication, transport, mediation, discovery, and other issues are detailed based on the discussion in Chap. 10.
© Springer-Verlag London 2015. Complex systems are ubiquitous and have become increasingly focused on scientific and business domains since they are part of our daily life, business, and environment. Deeply understanding the intricacies of complex systems is thus a basic task in the scientific domain. In this chapter, we explore: The system complexities of complex systems, to summarize the main characteristics of complex systemsSystem transparency, to outline general categories of complex (as well as simple) systems in terms of the transparency of their content and complexity to usersSystem classification, to create multiple dimensions for categorizing complex systemsOpen complex systems, to discuss their characteristics and challengesLarge-scale systems, to discuss those complex systems that have a huge number of componentsHybrid intelligent systems, to show those systems that hybridize different techniques, methods, and toolsComputing and engineering complex systems, to summarize the main computing paradigms, system analysis and design, and the objectives of metasynthetic computing and engineering
© Springer-Verlag London 2015. To analyze, design, and implement problem-solving solutions for complex systems, we need effective computing paradigms. The methodologies for engineering complex systems have been evolving toward: 1.Addressing increasingly complex problems and building corresponding systems2.Providing more user-friendly interfaces3.Supporting enterprise application integration Usually, a computing paradigm (methodology) is proposed on top of a core metaphor and concept. With the proposal of core concepts such as objects, components, services, agents, agent service, organization, and cloud, corresponding software engineering methodologies have also been proposed or are the subject of study at the same time. Objects , components , services , and agents  are very popular but high-level abstraction conceptions. They have been or are currently used by software academic designers and industrial architects to construct software systems that model the real world. Subsequently, methodologies including object-oriented methodology, component-based methodology, service-oriented methodology, and agent-oriented methodology have been proposed to analyze, design, and implement the complexities in complex systems (usually engineering systems). In addition, increased attention has been paid to other general concepts in the social science, economics, and cultural domains; for instance, behavior, organization, autonomy, sociality, cloud, and service. These have formed new computing paradigms, including autonomic computing, behavior computing, social computing, and cloud computing. More recently, to address the system complexities in open complex systems, metasynthetic computing has been proposed. In this chapter, the concepts, basic principles, and strengths and weaknesses of the above core concepts and computing paradigms are discussed. The aim is to provide basic concepts to understand what computing and engineering methodologies are available and which o...
© Springer-Verlag London 2015. While visual modeling provides a visible way to present and represent system constituents, relationships, and dynamics, formal modeling complements it by formalizing the above aspects embedded in a system. Usually, this is done through developing temporal logics-based representation and modeling tools. In this chapter, formal modeling blocks are discussed to model organizational elements discussed in previous chapters, including actor, rule, relationship, interaction, goal and properties.
© Springer-Verlag London 2015. Visual modeling and formal modeling provide respective and often complementary means for capturing system complexities and functions. In this chapter, integrative modeling is introduced, which aims to combine functional requirements with nonfunctional ones, and visual modeling with formal modeling.
© Springer-Verlag London 2015. Behavioral and social applications are ubiquitous, ranging from business and online applications to social and organizational applications and domains. With the increasing and continuous development of such applications, an emerging need is to develop an in-depth understanding of the underlying working mechanism, driving force, dynamics and evolution of a behavioral and/or social system, as well as the impact on business and context. To this end, building on the classic theories and tools available in behavioral science, social science, behavior informatics (Cao L, Inform Sci 180:3067–3085, 2010; Cao L, Yu PS (eds), Behavior computing: modeling, analysis, mining and decision. Springer, Berlin, 2012), and social informatics (Liu H, Salerno J, Young MJ (eds), Social computing, behavioral modeling, and prediction. Springer, Berlin, 2008) have recently been studied to "formalize," "quantify," and "compute" complex behavioral and social applications.
© Springer-Verlag London 2015. The problems and systems we tackle in our daily business are becoming increasingly complex, and as a consequence, existing methodologies and tools are similarly confronted by heightened challenges.
© Springer-Verlag London 2015. From the discussion of integrative modeling and organization-oriented analysis, we have seen that conceptualization and declarative knowledge have been deeply involved in these processes. In practice, sharing, transferring, and transforming a conceptualization system from system analysis to design and implementation are critical issues in building a deployable system of software engineering. Precision, usability, continuability, deployability, and scalability are key objectives in developing the sharable conceptions and declarative knowledge in the whole life cycle of OADI. This is what ontological engineering  can do.
© Springer-Verlag London 2015. The world is becoming increasingly connected, both loosely and tightly, in terms of explicit or implicit coupling relationships. However large-scale a system is, scale may be less important in the new data and information processing technology evolution than other system complexities, such as invisible heterogeneity, coupling relationships, human involvement, and ubiquitous intelligence in particular.
© Springer-Verlag London 2015. In Chap. 9, an integrative modeling framework was proposed for modeling open agent systems. In Chap. 6, we advocated an ORGANISED framework to undertake the organizational abstraction of an agent organization in terms of organizational metaphor. Collections of visual model-building blocks were subsequently specified and defined in Chap. 7 for the concrete, graphic analysis of an open agent organization. Chap. 8 presented a system of formal representation of the ORGANISED framework through temporal logics. This body of work constitutes the organization-oriented analysis system of open agent systems in terms of organizational metaphor.
© Springer-Verlag London 2015. Computing and engineering complex systems requires effective methodologies that can cater for the specific system complexities within those systems. In particular, such a methodology should address the characteristics of complex systems including openness to the environment, interactions, relationships and rules, and sociality, and should support both the analysis and design of complex problem-solving systems.
© Springer-Verlag London 2015. In human history, many different methodologies, metaphors, and philosophies (Auyang SY, Foundations of complex-system theories: in economics, evolutionary biology, and statistical physics. Cambridge University Press, 1999; Cao L, Dai R, Open complex intelligent systems (in Chinese). Post & Telecom, 2008; Dai R, Li Y, Li Q, Social intelligence and metasynthetic system (in Chinese). Post & Telecom, 2013; Dai R, Pattern Recogn Artif Intell 6(2):60–65, 1993; Qian X, Yu J, Dai R, Chin J Nat 13(1): 3–10, 1990; Qian X, Pattern Recogn Artif Intell 4(1):5–8, 1991; Qian X, Building systematology (in Chinese). Shanghai Jiaotong University Press, 2007; Cao L, Zhang C, Zhou M, IEEE Trans Syst Man Cy C: Appl Rev, 2008; Cao L, Dai R, Zhou M, IEEE Trans Syst Man Cy A 39(5):1007–1021, 2009) have been proposed by scientists, philosophers, theologians, and thinkers. Scientific development has greatly benefited from them, as is shown in the history of the physics family. The appropriate understanding of complex systems, system complexities, and the ubiquitous intelligence surrounding and embedded in complex systems, as well as engineering such systems and complexities, require full acknowledgement of the functions, characteristics, and suitability of respective methodologies.
© Springer-Verlag London 2015. Complex systems involve multiple aspects such as domain knowledge, constraints, human roles and interaction, life cycle and process management, and organizational and social factors. Many complex systems are dynamic and need to cater for online, run time, and ad hoc requests. With the involvement of social intelligence and its complexities, such complex systems need to consider reliability, reputation, risk, privacy, security, trust, and actionability of problem-solving solutions. Research in one area can actually stimulate, complement, and enhance research in another. A typical example is agent mining technology (Cao L, Gorodetsky V, Mitkas P, IEEE Intell Syst, May/June, 2009; Cao L, Gorodetsky V, Mitkas P, Guest editors' introduction: Agents and data mining, May/June, 2009; Cao L (ed), Data mining and multiagent integration. Springer, 2009; Gorodetsky V, et al (eds), Autonomous intelligent systems: agents and data mining. LNAI, vol 4476. Springer, 2007), which synergizes the ubiquitous intelligence for handling complex intelligent problems and systems through the combined strengths of data mining, machine learning, and multiagent systems. Other typical examples that involve ubiquitous intelligence include open complex intelligent systems (Cao L, Dai R, Open complex intelligent systems. Post & Telecom Press, 2008; Qian X, Yu J, Dai R, Chin J Nat 13(1):3–10, 1990), domain-driven actionable knowledge discovery (Cao L et al, IEEE Intell Syst 22(4):78–89, 2007), combined mining for discovering complex patterns (Cao L, Zhang H, Zhao Y, Zhang C, General frameworks for combined mining: case studies in e-government services. Submitted to ACM TKDD, 2008), and ubiquitous computing (Poslad S, Ubiquitous computing: smart devices, environments and interactions. Wiley, 2009).
© Springer-Verlag London 2015. Visual modeling means to model a system and its components, relationships and interactions, structures and patterns, etc. by setting up visual building blocks. Visual modeling adopts reductionism methodology and focuses on modeling each part, and the interaction between parts, to achieve an overall picture of a complex system.
Li, M, Li, J, Ou, Y & Cao, L 2015, 'A Coupled Similarity Kernel for Pairwise Support Vector Machine' in Lecture Notes in Computer Science, Springer International Publishing, pp. 114-123.View/Download from: Publisher's site
Li, J, Cao, L, Wang, C, Tan, KC & Liu, B 2013, 'Preface' in Li, J, Cao, L, Wang, C, Tan, KC, Liu, B, Pei, J & Tseng, VS (eds), Trends and Applicationsin Knowledge Discoveryand Data Mining: PAKDD 2013 International Workshops:DMApps, DANTH, QIMIE, BDM, CDA, CloudSDGold Coast, QLD, Australia, April 14-17, 2013Revised Selected Papers, pp. V-V.
Motoda, H, Wu, Z, Cao, L, Zaiane, O, Yao, M & Wang, W 2013, 'Preface' in Motoda, H, Wu, Z, Cao, L, Zaiane, O, Yao, M & Wang, W (eds), Advanced Data Miningand Applications: 9th International Conference, ADMA 2013Hangzhou, China, December 14-16, 2013Proceedings, Part I, pp. VI-VI.
Motoda, H, Wu, Z, Cao, L, Zaiane, O, Yao, M & Wang, W 2013, 'Preface' in Motoda, H, Wu, Z, Cao, L, Zaiane, O, Yao, M & Wang, W (eds), Advanced Data Miningand Applications: 9th International Conference, ADMA 2013Hangzhou, China, December 14-16, 2013Proceedings, Part II, Springer, pp. vi-vi.
Yu, PS, Singh, MP, Cao, L, Zeng, Y, Symeonidis, AL & Gorodetsky, V 2013, 'Preface' in Cao, L, Zeng, Y, Symeonidis, AL, Gorodetsky, VI, Yu, PS & Singh, MP (eds), Agents andData Mining Interaction: 8th International Workshop, ADMI 2012Valencia, Spain, June 4-5, 2012Revised Selected Papers, pp. V-VI.
Cao, L, Motoda, H, Srivastava, J, Lim, EP, King, I, Yu, PS, Nejdl, W, Xu, G, Li, G & Zhang, Y 2013, 'Preface' in Behavior and Social Computing, Springer, Germany, pp. v-vi.View/Download from: Publisher's site
Cao, L, Srivastava, J, Williams, G & Motoda, H 2012, 'International Workshop on Behavior Informatics (BI 2011): PC chairs' message' in New Frontiers in Applied Data Mining: PAKDD 2011 International WorkshopsShenzhen, China, May 24-27, 2011Revised Selected Papers, pp. VII-VIII.
Wang, C & Cao, L 2012, 'Modeling and Analysis of Social Activity Process' in Cao, L & Yu, PS (eds), Behavior Computing Modeling, Analysis, Mining and Decision, Springer-Verlag London Ltd., London, UK, pp. 21-35.View/Download from: Publisher's site
Behavior modeling has been increasingly recognized as a crucial means for disclosing interior driving forces and impact in social activity processes. Traditional behavior modeling in behavior and social sciences that mainly relies on qualitative methods is not aimed at deep and quantitative analysis of social activities. However, with the booming needs of understanding customer behaviors and social networks etc., there is a shortage of formal, systematic and unified behavior modeling and analysis methodologies and techniques. This paper proposes a novel and unified general framework, called Social Activity Process Modeling and Analysis System (SAPMAS). Our approach is to model social behaviors and analyze social activity processes by using model checking. More specifically, we construct behavior models from sub-models of actor, action, environment and relationship, followed by the translation from concrete properties to formal temporal logic formulae, finally obtain analyzing results with model checker SPIN. Online shopping process is illustrated to explain this whole framework.
Weiss, G, Yu, PS, Cao, L, Bazzan, A, Symeonidis, AL & Gorodetsky, V 2012, 'Message from the workshop chairs' in Agentsand Data MiningInteraction: 7th International Workshop, ADMI 2011Taipei, Taiwan, May 2-6, 2011Revised Selected Papers, pp. V-VI.View/Download from: Publisher's site
Tsai, P, Tran, TP & Cao, L 2010, 'A New Multimodal Biometric for Personal Identification' in Herout, A (ed), Pattern Recognition Recent Advances, InTech, pp. 341-366.
Weiss, G, Yu, PS, Cao, L, Bazzan, A, Mitkas, PA & Gorodetsky, V 2010, 'Message from the Workshop Chairs' in Cao, L, Bazzan, ALC, Gorodetsky, V, Mitkas, PA, Weiss, G & Yu, PS (eds), Agentsand Data MiningInteraction: 6th International Workshop on Agentsand Data Mining Interaction, ADMI 2010Toronto, ON, Canada, May 11, 2010Revised Selected Papers, pp. v-vi.
Zhang, H, Zhao, Y, Cao, L, Zhang, C & Bohlscheid, H 2010, 'Rare class association rule mining with multiple imbalanced attributes' in Koh, YS & Rountree, N (eds), Rare Association Rule Mining and Knowledge Discovery: Technologies for Infrequent and Critical Event, IGI Global, Hershey, Pennsylvania, pp. 66-75.
In this chapter, the authors propose a novel framework for rare class association rule mining. In each class association rule, the right-hand is a target class while the left-hand may contain one or more attributes. This algorithm is focused on the multiple imbalanced attributes on the left-hand. In the proposed framework, the rules with and without imbalanced attributes are processed in parallel. The rules without imbalanced attributes are mined through a standard algorithm while the rules with imbalanced attributes are mined based on newly defined measurements. Through simple transformation, these measurements can be in a uniform space so that only a few parameters need to be specified by user. In the case study, the proposed algorithm is applied in the social security field. Although some attributes are severely imbalanced, rules with a minority of imbalanced attributes have been mined efficiently.
Cao, L 2009, 'Actionable Knowledge Discovery' in Mehdi Khosrow-Pour (ed), Encyclopedia of Information Science and Technology, IGI Global, Hershey, PA, USA, pp. 8-13.View/Download from: Publisher's site
Actionable knowledge discovery is selected as one of the greatest challenges (Ankerst, 2002; Fayyad, Shapiro, & Uthurusamy, 2003) of next-generation knowledge discovery in database (KDD) studies (Han & Kamber, 2006). In the existing data mining, often mined patterns are nonactionable to real user needs. To enhance knowledge actionability, domain-related social intelligence is substantially essential (Cao et al., 2006b). The involvement of domain-related social intelligence into data mining leads to domaindriven data mining (Cao & Zhang, 2006a, 2007a), which complements traditional data-centered mining methodology. Domain-related social intelligence consists of intelligence of human, domain, environment, society and cyberspace, which complements data intelligence. The extension of KDD toward domain-driven data mining involves many challenging but promising research and development issues in KDD. Studies in regard to these issues may promote the paradigm shift of KDD from data-centered interesting pattern mining to domain-driven actionable knowledge discovery, and the deployment shift from simulated data set-based to real-life data and business environment-oriented as widely predicted.
Cao, L 2009, 'Developing Actionable Trading Strategies' in Jain, LC & Nguyen, NT (eds), Knowledge Processing and Decision Making in Agent-Based Systems, Springer, Berlin, Germany, pp. 193-215.View/Download from: Publisher's site
Actionable trading strategies for trading agents determine the potential of the simulated models in real-life markets. The development of actionable strategies is a non-trivial task, which needs to consider real-life constraints and organizational factors in the market. In this paper, we first analyze such constraints on developing actionable trading strategies. Further we propose an actionable trading strategy development framework. These points are deployed into developing a series of actionable trading strategies through optimizing, enhancing, discovering and integrating actionable trading strategies. We demonstrate working case studies in market data. These approaches and their performance are evaluated from both technical and business perspectives. Actionable trading strategies have potential to supporting smart trading decision for brokerage firms and financial companies.
In recent years, more and morc researchers have been involved in research on both agent technology and data mining. A clear disciplinary effort has been activa ted toward removing the boundary between them, that is the interaction and integrati on be tween agent technology and data mining. We refer this to agent mining as a new area. The marriage of agents and data mining is driven by challenges faced by both communities, and the need of developing more advanced intelligence, in formation processi ng and systems. This chapter presents an overall picture of agent mining from the perspective of positioning it as an emerging area. We summarize the main driving forces, compleme ntary essence, di sci plinary framework , applications, case studies, and trends and directions, as well as brief observation on agent-driven data mining, data mining-driven agents, and mutual issues in agent mini ng. Arguably, we draw the following conclusions: (I) agent mining emerges as a new area in the scientific fam il y, (2) both agent technology and data mining can greatly benefit from agent mining, (3) it is very promising to resu lL in additional advancement in intelligent information processing and systems. However, as a new open area, there are many issues waiting for research and development from theoretical, technological and practical perspectives.
Cao, L, Yu, P, Zhang, C & Zhang, H 2009, 'Introduction to Domain Driven Data Mining' in Cao, L, Yu, PS, Zhang, C & Zhang, H (eds), Data Mining for Business Applications, Springer, New York, USA, pp. 3-10.View/Download from: Publisher's site
Data Mining for Business Applications presents state-of-the-art data mining research and development related to methodologies, techniques, approaches and successful applications. The contributions of this book mark a paradigm shift from "data-centered pattern mining" to "domain-driven actionable knowledge discovery (AKD)" for next-generation KDD research and applications. The contents identify how KDD techniques can better contribute to critical domain problems in practice, and strengthen business intelligence in complex enterprise applications. The volume also explores challenges and directions for future data mining research and development in the dialogue between academia and business.
McNicholas, PD & Zhao, Y 2009, 'Association Rules: An Overview' in Zhao, Y, Zhang, C & Cao, L (eds), Post-Mining of Association Rules: Techniques for Effective Knowledge Extraction, IGI Global, USA, pp. 1-10.View/Download from: Publisher's site
Association rules present one of the most versatile techniques for the analysis of binary data, with applications in areas as diverse as retail, bioinformatics, and sociology. In this chapter, the origin of association rules is discussed along with the functions by which association rules are traditionally characterised. Following the formal definition of an association rule, these functions â support, confidence and lift â are defined and various methods of rule generation are presented, spanning 15 years of development. There is some discussion about negations and negative association rules and an analogy between association rules and 2Ã2 tables is outlined. Pruning methods are discussed, followed by an overview of measures of interestingness. Finally, the post-mining stage of the association rule paradigm is put in the context of the preceding stages of the mining process.
Wu, S, Zhao, Y, Zhang, H, Zhang, C, Cao, L & Bohlscheid, H 2009, 'Debt Detection in Social Security by Adaptive Sequence Classification' in Karagiannis, D & Jin, Z (eds), Knowledge Science, Engineering and Management, Springer, Germany, pp. 192-203.View/Download from: Publisher's site
Debt detection is important for improving payment accuracy in social security. Since debt detection from customer transaction data can be generally modelled as a fraud detection problem, a straightforward solution is to extract features from transaction sequences and build a sequence classifier for debts. For long-running debt detections, the patterns in the transaction sequences may exhibit variation from time to time, which makes it imperative to adapt classification to the pattern variation. In this paper, we present a novel adaptive sequence classification framework for debt detection in a social security application. The central technique is to catch up with the pattern variation by boosting discriminative patterns and depressing less discriminative ones according to the latest sequence data.
Zhao, Y, Cao, L, Zhang, H & Zhang, C 2009, 'Data Clustering' in Ferraggine, VE, Doorn, JH & Rivero, LC (eds), Handbook of Research on Innovations in Database Technologies and Applications: Current and Future Tr, IGI Global, USA, pp. 562-572.View/Download from: Publisher's site
Clustering is one of the most important techniques in data mining. This chapter presents a survey of popular approaches for data clustering, including well-known clustering techniques, such as partitioning clustering, hierarchical clustering, density-based clustering and grid-based clustering, and recent advances in clustering, such as subspace clustering, text clustering and data stream clustering. The major challenges and future trends of data clustering will also be introduced in this chapter. The remainder of this chapter is organized as follows. The background of data clustering will be introduced in Section 2, including the definition of clustering, categories of clustering techniques, features of good clustering algorithms, and the validation of clustering. Section 3 will present main approaches for clustering, which range from the classic partitioning and hierarchical clustering to recent approaches of bi-clustering and semisupervised clustering. Challenges and future trends will be discussed in Section 4, followed by the conclusions in the last section.
Zhao, Y, Zhang, H, Cao, L, Bohlscheid, H, Ou, Y & Zhang, C 2009, 'Data Mining Applications in Social Security' in Cao, L, Yu, PS, Zhang, C & Zhang, H (eds), Data Mining for Business Applications, Springer, New York, USA, pp. 81-96.View/Download from: Publisher's site
This chapter presents four applications of data mining in social security. The first is an application of decision tree and association rules to find the demographic patterns of customers. Sequence mining is used in the second application to find activity sequence patterns related to debt occurrence. In the third application, combined association rules are mined from heterogeneous data sources to discover patterns of slow payers and quick payers. In the last application, clustering and analysis of variance are employed to check the effectiveness of a new policy.
Quantitative intelligence based traditional data mining is facing grand challenges from real-world enterprise and cross-organization applications. For instance, the usual demonstration of specific algorithms cannot support business users to take actions to their advantage and needs. We think this is due to Quantitative Intelligence focused data-driven philosophy. It either views data mining as an autonomous data-driven, trial-and-error process, or only analyzes business issues in an isolated, case-by-case manner. Based on experience and lessons learnt from real-world data mining and complex systems, this article proposes a practical data mining methodology referred to as Domain-Driven Data Mining. On top of quantitative intelligence and hidden knowledge in data, domain-driven data mining aims to meta-synthesize quantitative intelligence and qualitative intelligence in mining complex applications in which human is in the loop. It targets actionable knowledge discovery in constrained environment for satisfying user preference. Domain-driven methodology consists of key components including understanding constrained environment, business-technical questionnaire, representing and involving domain knowledge, human-mining cooperation and interaction, constructing next-generation mining infrastructure, in-depth pattern mining and postprocessing, business interestingness and actionability enhancement, and loop-closed human-cooperated iterative refinement. Domain-driven data mining complements the data-driven methodology, the metasynthesis of qualitative intelligence and quantitative intelligence has potential to discover knowledge from complex systems, and enhance knowledge actionability for practical use by industry and business.
Cao, L, Zhang, C, Luo, D & Dai, R 2007, 'Intelligence Metasynthesis in Building Business Intelligence Systems' in Carbonell, JG, Siekmann, J, Zhong, N, Liu, J, Yao, Y, Wu, J, Lu, S & Li, K (eds), Lecture Notes in Artificial Intelligence - Lecture Notes in Computer Science (Book Series), Springer, Germany, pp. 454-470.View/Download from: Publisher's site
In our previous work, we have analyzed the shortcomings of existing business intelligence (BI) theory and its actionable capability. One of the works we have presented is the ontology-based integration of business, data warehousing and data mining. This way may make existing BI systems as user and business-friendly as expected. However, it is challenging to tackle issues and construct actionable and business friendly systems by simply improving existing BI framework. Therefore, in this paper, we further propose a new framework for constructing next generation BI systems. That is intelligence metasynthesis, namely the next-generation BI systems should to some extent synthesize four types of intelligence, including data intelligence, domain intelligence, human intelligence and network/web intelligence. The theory for guiding the intelligence metasynthesis is metasynthetic engineering. To this end, an appropriate intelligence integration framework is substantially important. We first address the roles of each type of intelligence in developing nextgeneration BI systems. Further, implementation issues are addressed by discussing key components for synthesizing the intelligence. The proposed framework is based on our real-world experience and practice in designing and implementing BI systems. It also greatly benefits from multi-disciplinary knowledge dialog such as complex intelligent systems and cognitive sciences. The proposed theoretical framework has potential to deal with key challenges in existing BI framework and systems.
Cao, L, Hu, L, Jian, S, Gu, Z, Chen, Q & Amirbekyan, A 2019, 'HERS: Modeling Influential Contexts with Heterogeneous Relations for Sparse and Cold-Start Recommendation', Proceedings of the ... AAAI Conference on Artificial Intelligence. AAAI Conference on Artificial Intelligence, AAAI Conference on Artificial Intelligence, AAAI Press, Honolulu, Hawaii USA, pp. 3830-3837.View/Download from: Publisher's site
Classic recommender systems face challenges in addressing the data sparsity and cold-start problems with only modeling the user-item relation. An essential direction is to incorporate and understand the additional heterogeneous relations, e.g., user-user and item-item relations, since each user-item interaction is often influenced by other users and items, which form the user's/item's influential contexts. This induces important yet challenging issues, including modeling heterogeneous relations, interactions, and the strength of the influence from users/items in the influential contexts. To this end, we design Influential-Context Aggregation Units (ICAU) to aggregate the user-user/item-item relations within a given context as the influential context embeddings. Accordingly, we propose a Heterogeneous relations-Embedded Recommender System (HERS) based on ICAUs to model and interpret the underlying motivation of user-item interactions by considering user-user and item-item influences. The experiments on two real-world datasets show the highly improved recommendation quality made by HERS and its superiority in handling the cold-start problem. In addition, we demonstrate the interpretability of modeling influential contexts in explaining the recommendation results.
Cao, L, Xu, P, Deng, Z, Choi, K-S & Wang, S 2019, 'Multi-View Information-Theoretic Co-Clustering for Co-Occurrence Data', Proceedings of the ... AAAI Conference on Artificial Intelligence. AAAI Conference on Artificial Intelligence, AAAI Conference on Artificial Intelligence, AAAI Press, Honolulu,Hawaii, USA, pp. 379-386.View/Download from: Publisher's site
Multi-view clustering has received much attention recently. Most of the existing multi-view clustering methods only focus on one-sided clustering. As the co-occurring data elements involve the counts of sample-feature co-occurrences, it is more efficient to conduct two-sided clustering along the samples and features simultaneously. To take advantage of two-sided clustering for the co-occurrences in the scene of multi-view clustering, a two-sided multi-view clustering method is proposed, i.e., multi-view information-theoretic co-clustering (MV-ITCC). The proposed method realizes two-sided clustering for co-occurring multi-view data under the formulation of information theory. More specifically, it exploits the agreement and disagreement among views by sharing a common clustering results along the sample dimension and keeping the clustering results of each view specific along the feature dimension. In addition, the mechanism of maximum entropy is also adopted to control the importance of different views, which can give a right balance in leveraging the agreement and disagreement. Extensive experiments are conducted on text and image multiview datasets. The results clearly demonstrate the superiority of the proposed method.
Jian, S, Hu, L, Cao, L, Lu, K & Gao, H 2019, 'Evolutionarily Learning Multi-Aspect Interactions and Influences from Network Structure and Node Content', Proceedings of the ... AAAI Conference on Artificial Intelligence. AAAI Conference on Artificial Intelligence, AAAI Conference on Artificial Intelligence, AAAI Press, Honolulu, Hawaii USA, pp. 598-605.View/Download from: Publisher's site
The formation of a complex network is highly driven by multi-aspect node influences and interactions, reflected on network structures and the content embodied in network nodes. Limited work has jointly modeled all these aspects, which typically focuses on topological structures but overlooks the heterogeneous interactions behind node linkage and contributions of node content to the interactive heterogeneities. Here, we propose a multi-aspect interaction and influence-unified evolutionary coupled system (MAI-ECS) for network representation by involving node content and linkage-based network structure. MAI-ECS jointly and iteratively learns two systems: a multi-aspect interaction learning system to capture heterogeneous hidden interactions between nodes and an influence propagation system to capture multiaspect node influences and their propagation between nodes. MAI-ECS couples, unifies and optimizes the two systems toward an effective representation of explicit node content and network structure, and implicit node interactions and influences. MAI-ECS shows superior performance in node classification and link prediction in comparison with the stateof-the-art methods on two real-world datasets. Further, we demonstrate the semantic interpretability of the results generated by MAI-ECS.
Shi, L, Li, S, Cao, L, Yang, L & Pan, G 2019, 'TBQ(σ): Improving efficiency of trace utilization for off-policy reinforcement learning', Proceedings of the International Joint Conference on Autonomous Agents and Multiagent Systems, AAMAS, International Joint Conference on Autonomous Agents and Multiagent Systems, IFAAMAS, Montreal, Canada, pp. 1025-1032.
© 2019 International Foundation for Autonomous Agents and Multiagent Systems (www.ifaamas.org) Ail rights reserved. OfF-policy reinforcement learning with eligibility traces faces is challenging because of the discrepancy between target policy and behavior policy One common approach is to measure the difference between two policies in a probabilistic way, such as importance sampling and tree-backup However, existing off-policy learning methods based on probabilistic policy measurement are inefficient when utilizing traces under a greedy target policy, which is ineffective for control problems The traces are cut immediately when a non-greedy action is taken, which may lose the advantage of eligibility traccs and slow down the learning process Alternatively, some non-probabdistic measurement methods such as General Q(A) and Naive Q(A) never cut traces, but face convergence problems in practice To address the above issues, this paper introduces a new method named TBQ(a) which effectively unifies the tree-backup algorithm and Naive Q(A) By introducing a new parameter a to illustrate the degree of utilizing traces, TBQ(
Wang, S, Hu, L, Wang, Y, Cao, L, Sheng, QZ & Orgun, M 2019, 'Sequential recommender systems: Challenges, progress and prospects', IJCAI International Joint Conference on Artificial Intelligence, International Joint Conference on Artificial Intelligence, IJCAI, Macao, pp. 6332-6338.View/Download from: Publisher's site
© 2019 International Joint Conferences on Artificial Intelligence. All rights reserved. The emerging topic of sequential recommender systems (SRSs) has attracted increasing attention in recent years. Different from the conventional recommender systems (RSs) including collaborative filtering and content-based filtering, SRSs try to understand and model the sequential user behaviors, the interactions between users and items, and the evolution of users' preferences and item popularity over time. SRSs involve the above aspects for more precise characterization of user contexts, intent and goals, and item consumption trend, leading to more accurate, customized and dynamic recommendations. In this paper, we provide a systematic review on SRSs. We first present the characteristics of SRSs, and then summarize and categorize the key challenges in this research area, followed by the corresponding research progress consisting of the most recent and representative developments on this topic. Finally, we discuss the important research directions in this vibrant area.
Wang, S, Hu, L, Wang, Y, Sheng, QZ, Orgun, M & Cao, L 2019, 'Modeling multi-purpose sessions for next-item recommendations via mixture-channel purpose routing networks', IJCAI International Joint Conference on Artificial Intelligence, International Joint Conference on Artificial Intelligence, International Joint Conferences on Artificial Intelligence, Macao, pp. 3771-3777.View/Download from: Publisher's site
© 2019 International Joint Conferences on Artificial Intelligence. All rights reserved. A session-based recommender system (SBRS) suggests the next item by modeling the dependencies between items in a session. Most of existing SBRSs assume the items inside a session are associated with one (implicit) purpose. However, this may not always be true in reality, and a session may often consist of multiple subsets of items for different purposes (e.g., breakfast and decoration). Specifically, items (e.g., bread and milk) in a subset have strong purpose-specific dependencies whereas items (e.g., bread and vase) from different subsets have much weaker or even no dependencies due to the difference of purposes. Therefore, we propose a mixture-channel model to accommodate the multi-purpose item subsets for more precisely representing a session. To address the shortcomings in existing SBRSs, this model recommends more diverse items to satisfy different purposes. Accordingly, we design effective mixture-channel purpose routing networks (MCPRNs) with a purpose routing network to detect the purposes of each item and assign them into the corresponding channels. Moreover, a purpose-specific recurrent network is devised to model the dependencies between items within each channel for a specific purpose. The experimental results show the superiority of MCPRN over the state-of-the-art methods in terms of both recommendation accuracy and diversity.
Cao, L, Chen, X, Liu, F, Tu, E & Yang, J 2018, 'Deep-PUMR: Deep Positive and Unlabeled Learning with Manifold Regularization', Lecture Notes in Computer Science, International Conference on Neural Information Processing, Springer Verlag, Siem Reap, Cambodia, pp. 12-20.View/Download from: Publisher's site
Training a binary classifier only on positive and unlabeled examples (i.e., the PU learning) is an important yet challenging issue, widely seen in many problems in which it is difficult to obtain negative examples. Existing methods for handling this challenge often perform unsatisfactorily, since they often ignore the relations between positive and unlabeled examples and are also limited to the traditional shallow learning frameworks. Therefore, this work proposes a new approach: Deep Positive and Unlabeled learning with Manifold Regularization (Deep-PUMR), which integrates the manifold regularization with deep neural networks to address the above issues with classic PU learning. Deep-PUMR holds two major advantages: (i) Our method exploits the manifold properties of data distribution to capture the relationship of positive and unlabeled examples; (ii) The adopted deep network enables Deep-PUMR with strong learning ability, especially on large-scale datasets. Extensive experiments on five diverse datasets demonstrate that Deep-PUMR achieves the state-of-the-art performance in comparison with classic PU learning algorithms and risk estimators.
Do, TDT & Cao, L 2018, 'Coupled Poisson factorization integrated with user/item metadata for modeling popular and sparse ratings in scalable recommendation', 32nd AAAI Conference on Artificial Intelligence, AAAI 2018, AAAI Conference on Artificial Intelligence, AAI, New Orleans, USA, pp. 2918-2925.
Copyright © 2018, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Modelling sparse and large data sets is highly in demand yet challenging in recommender systems. With the computation only on the non-zero ratings, Poisson Factorization (PF) enabled by variational inference has shown its high efficiency in scalable recommendation, e.g., modeling millions of ratings. However, as PF learns the ratings by individual users on items with the Gamma distribution, it cannot capture the coupling relations between users (items) and the rating popularity (i.e., favorable rating scores that are given to one item) and rating sparsity (i.e., those users (items) with many zero ratings) for one item (user). This work proposes Coupled Poisson Factorization (CPF) to learn the couplings between users (items), and the user/item attributes (i.e., metadata) are integrated into CPF to form the Metadata-integrated CPF (mCPF) to not only handle sparse but also popular ratings in very large-scale data. Our empirical results show that the proposed models significantly outperform PF and address the key limitations in PF for scalable recommendation.
Do, TDT & Cao, L 2018, 'Gamma-Poisson Dynamic Matrix Factorization Embedded with Metadata Influence', Advances in Neural Information Processing Systems, Advances in Neural Information Processing Systems, Neural Information Processing Systems Foundation, Montreal, Canada, pp. 1-12.
A conjugate Gamma-Poisson model for Dynamic Matrix Factorization incorporated with metadata influence (mGDMF for short) is proposed to effectively and efficiently model massive, sparse and dynamic data in recommendations. Modeling recommendation problems with a massive number of ratings and very sparse or even no ratings on some users/items in a dynamic setting is very demanding and poses critical challenges to well-studied matrix factorization models due to the large-scale, sparse and dynamic nature of the data. Our proposed mGDMF tackles these challenges by introducing three strategies: (1) constructing a stable Gamma-Markov chain model that smoothly drifts over time by combining both static and dynamic latent features of data; (2) incorporating the user/item metadata into the model to tackle sparse ratings; and (3) undertaking stochastic variational inference to efficiently handle massive data. mGDMF is conjugate, dynamic and scalable. Experiments show that mGDMF significantly (both effectively and efficiently) outperforms the state-of-the-art static and dynamic models on large, sparse and dynamic data.
Do, TDT & Cao, L 2018, 'Metadata-dependent infinite poisson factorization for efficiently modelling sparse and large matrices in recommendation', IJCAI International Joint Conference on Artificial Intelligence, International Joint Conference on Artificial Intelligence, IJCAI, Stockholm, Sweden, pp. 5010-5016.View/Download from: Publisher's site
© 2018 International Joint Conferences on Artificial Intelligence.All right reserved. Matrix Factorization (MF) is widely used in Recommender Systems (RSs) for estimating missing ratings in the rating matrix. MF faces major challenges of handling very sparse and large data. Poisson Factorization (PF) as an MF variant addresses these challenges with high efficiency by only computing on those non-missing elements. However, ignoring the missing elements in computation makes PF weak or incapable for dealing with columns or rows with very few observations (corresponding to sparse items or users). In this work, Metadata-dependent Poisson Factorization (MPF) is invented to address the user/item sparsity by integrating user/item metadata into PF. MPF adds the metadata-based observed entries to the factorized PF matrices. In addition, similar to MF, choosing the suitable number of latent components for PF is very expensive on very large datasets. Accordingly, we further extend MPF to Metadata-dependent Infinite Poisson Factorization (MIPF) that integrates Bayesian Nonparametric (BNP) technique to automatically tune the number of latent components. Our empirical results show that, by integrating metadata, MPF/MIPF significantly outperform the state-of-the-art PF models for sparse and large datasets. MIPF also effectively estimates the number of latent components.
Hu, L, Jian, S, Cao, L & Chen, Q 2018, 'Interpretable recommendation via attraction modeling: Learning multilevel attractiveness over multimodal movie contents', IJCAI International Joint Conference on Artificial Intelligence, International Joint Conferences on Artificial Intelligence, IJCAI, Stockholm, Sweden, pp. 3400-3406.
© 2018 International Joint Conferences on Artificial Intelligence. All right reserved. New contents like blogs and online videos are produced in every second in the new media age. We argue that attraction is one of the decisive factors for user selection of new contents. However, collaborative filtering cannot work without user feedback; and the existing content-based recommender systems are ineligible to capture and interpret the attractive points on new contents. Accordingly, we propose attraction modeling to learn and interpret user attractiveness. Specially, we build a multilevel attraction model (MLAM) over the content features-the story (textual data) and cast members (categorical data) of movies. In particular, we design multilevel personal filters to calculate users' attractiveness on words, sentences and cast members at different levels. The experimental results show the superiority of MLAM over the state-of-the-art methods. In addition, a case study is provided to demonstrate the interpretability of MLAM by visualizing user attractiveness on a movie.
Jian, S, Hu, L, Cao, L & Lu, K 2018, 'Metric-based auto-instructor for learning mixed data representation', 32nd AAAI Conference on Artificial Intelligence, AAAI 2018, AAAI Conference on Artificial Intelligence, AAAI, New Orleans, USA, pp. 3318-3325.
Copyright © 2018, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Mixed data with both categorical and continuous features are ubiquitous in real-world applications. Learning a good representation of mixed data is critical yet challenging for further learning tasks. Existing methods for representing mixed data often overlook the heterogeneous coupling relationships between categorical and continuous features as well as the discrimination between objects. To address these issues, we propose an auto-instructive representation learning scheme to enable margin-enhanced distance metric learning for a discrimination-enhanced representation. Accordingly, we design a metric-based auto-instructor (MAI) model which consists of two collaborative instructors. Each instructor captures the feature-level couplings in mixed data with fully connected networks, and guides the infinite-margin metric learning for the peer instructor with a contrastive order. By feeding the learned representation into both partition-based and density-based clustering methods, our experiments on eight UCI datasets show highly significant learning performance improvement and much more distinguishable visualization outcomes over the baseline methods.
Wang, L, Bao, X & Cao, L 2018, 'Interactive probabilistic post-mining of user-preferred spatial co-location patterns', Proceedings - IEEE 34th International Conference on Data Engineering, ICDE 2018, Paris, France, pp. 1260-1263.View/Download from: Publisher's site
© 2018 IEEE. Spatial co-location pattern mining is an important task in spatial data mining. However, traditional mining frameworks often produce too many prevalent patterns of which only a small proportion may be truly interesting to end users. To satisfy user preferences, this work proposes an interactive probabilistic post-mining method to discover user-preferred co-location patterns from the early-round of mined results by iteratively involving user's feedback and probabilistically refining preferred patterns. We first introduce a framework of interactively post-mining preferred co-location patterns, which enables a user to effectively discover the co-location patterns tailored to his/her specific preference. A probabilistic model is further introduced to measure the user feedback-based subjective preferences on resultant co-location patterns. This measure is used to not only select sample co-location patterns in the iterative user feedback process but also rank the results. The experimental results on real and synthetic data sets demonstrate the effectiveness of our approach.
Xu, J & Cao, L 2018, 'Vine copula-based asymmetry and tail dependence modeling', Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Pacific-Asia Conference on Knowledge Discovery and Data Mining, Melbourne, VIC, Australia, pp. 285-297.View/Download from: Publisher's site
© Springer International Publishing AG, part of Springer Nature 2018. Financial variables such as asset returns in the massive market contain various hierarchical and horizontal relationships that form complicated dependence structures. Modeling these structures is challenging due to the stylized facts of market data. Many research works in recent decades showed that copula is an effective method to describe relations among variables. Vine structures were introduced to represent the decomposition of multivariate copula functions. However, the model construction of vine structures is still a tough problem owing to the geometrical data, conditional independent assumptions and the stylized facts. In this paper, we introduce a new bottom-to-up method to construct regular vine structures and applies the model to 12 currencies over 16 years as a case study to analyze the asymmetric and fat tail features. The out-of-sample performance of our model is evaluated by Value at Risk, a widely used industrial benchmark. The experimental results show that our model and its intrinsic design significantly outperform industry baselines, and provide financially interpretable knowledge and profound insights into the dependence structures of multi-variables with complex dependencies and characteristics.
Xu, J, Wei, W & Cao, L 2017, 'Copula-based high dimensional cross-market dependence modeling', Proceedings - 2017 International Conference on Data Science and Advanced Analytics, DSAA 2017, IEEE International Conference on Data Science and Advanced Analytics, IEEE, Tokyo, Japan, pp. 734-743.View/Download from: Publisher's site
© 2017 IEEE. Dependence across multiple financial markets, such as stock and foreign exchange rate markets, is high-dimensional, contains various relationships, and often presents complicated dependence structures and characteristics such as asymmetrical dependence. Modeling such dependence structures is very challenging. Although copula has been demonstrated to be effective in describing dependence between variables in recent studies, building effective dependence structures to address the above complexities significantly challenges existing copula models. In this paper, we propose a new D vine-based model with a bottom-up strategy to construct high-dimensional dependence structures. The new modeling outcomes are applied to trade 15 stock market indices and 10 currency rates over 16 years as a case study. Extensive experimental results show that this model and its intrinsic design significantly outperform typical models and industry baselines, as shown by the log-likelihood and Vuong test, and Value at Risk - a widely used industrial benchmark. Our model provides interpretable knowledge and profound insights into the high-dimensional dependence structures across data sources.
Zhang, L, Cao, L, Luo, S, Gu, L, Chen, Y & Lian, Y 2018, 'Coupled collective matrix factorization', Proceedings - 2018 IEEE SmartWorld, Ubiquitous Intelligence and Computing, Advanced and Trusted Computing, Scalable Computing and Communications, Cloud and Big Data Computing, Internet of People and Smart City Innovations, SmartWorld/UIC/ATC/ScalCom/CBDCom/IoP/SCI 2018, IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computing, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation, Guangzhou, China, pp. 1023-1030.View/Download from: Publisher's site
© 2018 IEEE. Collective Matrix Factorization (CMF) makes rating prediction by jointly factorizing multiple matrices in recommender systems (RS), which also provides a unified view of matrix factorization. However, CMF does not directly involve the user attributes and item attributes that represent the intrinsic characteristics of users and items, so it fails to capture the coupling relationships within and between entities, such as users and items, which represent low-level data characteristics and complexities and drive the rating dynamics. In this work, we propose a coupled CMF (CCMF), which not only accommodates entity attributes into rating prediction, but also incorporates the couplings within and between entities into CMF. Therefore, CCMF not only captures the latent variable-based relationships between ratings and specific dimensions at high levels, but also captures the underlying driving forces, i.e., the hierarchical couplings within and between entities representing the low-level data characteristics and complexities. This work also presents a unified framework of CCMF in RS. Experimental results on two real data sets show that our proposed model outperforms the MF-based approaches.
Zhang, Q, Cao, L, Zhu, C, Li, Z & Sun, J 2018, 'CoupledCF: Learning explicit and implicit user-item couplings in recommendation for deep collaborative filtering', IJCAI International Joint Conference on Artificial Intelligence, International Joint Conference on Artificial Intelligence, AAAI Press, Stockholm, Sweden, pp. 3662-3668.View/Download from: Publisher's site
© 2018 International Joint Conferences on Artificial Intelligence. All right reserved. Non-IID recommender system discloses the nature of recommendation and has shown its potential in improving recommendation quality and addressing issues such as sparsity and cold start. It leverages existing work that usually treats users/items as independent while ignoring the rich couplings within and between users and items, leading to limited performance improvement. In reality, users/items are related with various couplings existing within and between users and items, which may better explain how and why a user has personalized preference on an item. This work builds on non-IID learning to propose a neural user-item coupling learning for collaborative filtering, called CoupledCF. CoupledCF jointly learns explicit and implicit couplings within/between users and items w.r.t. user/item attributes and deep features for deep CF recommendation. Empirical results on two real-world large datasets show that CoupledCF significantly outperforms two latest neural recommenders: neural matrix factorization and Google's Wide&Deep network.
Lian, D, Zheng, K, Cao, L, Zheng, VW, Tsang, IW, Ge, Y & Xie, X 2018, 'High-order proximity preserving information network hashing', Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, International Conference on Knowledge Discovery & Data Mining, ACM, London, United Kingdom, pp. 1744-1753.View/Download from: Publisher's site
© 2018 Association for Computing Machinery. Information network embedding is an effective way for efficient graph analytics. However, it still faces with computational challenges in problems such as link prediction and node recommendation, particularly with increasing scale of networks. Hashing is a promising approach for accelerating these problems by orders of magnitude. However, no prior studies have been focused on seeking binary codes for information networks to preserve high-order proximity. Since matrix factorization (MF) unifies and outperforms several well-known embedding methods with high-order proximity preserved, we propose a MF-based Information Network Hashing (INH-MF) algorithm, to learn binary codes which can preserve high-order proximity. We also suggest Hamming subspace learning, which only updates partial binary codes each time, to scale up INH-MF. We finally evaluate INH-MF on four real-world information network datasets with respect to the tasks of node classification and node recommendation. The results demonstrate that INH-MF can perform significantly better than competing learning to hash baselines in both tasks, and surprisingly outperforms network embedding methods, including DeepWalk, LINE and NetMF, in the task of node recommendation. The source code of INH-MF is available online1
Wang, S, Hu, L, Cao, L, Huang, X, Lian, D & Liu, W 2018, 'Attention-based transactional context embedding for next-item recommendation', Proceedings of the ... AAAI Conference on Artificial Intelligence. AAAI Conference on Artificial Intelligence, AAAI Conference on Artificial Intelligence, AAAI, New Orleans, United States, pp. 2532-2539.
To recommend the next item to a user in a transactional context is practical yet challenging in applications such as marketing campaigns. Transactional context refers to the items that are observable in a transaction. Most existing transaction-based recommender systems (TBRSs) make recommendations by mainly considering recently occurring items instead of all the ones observed in the current context. Moreover, they often assume a rigid order between items within a transaction, which is not always practical. More importantly, a long transaction often contains many items irreverent to the next choice, which tends to overwhelm the influence of a few truely relevant ones. Therefore, we posit that a good TBRS should not only consider all the observed items in the current transaction but also weight them with different relevance to build an attentive context that outputs the proper next item with a high probability. To this end, we design an effective attention-based transaction embedding model (ATEM) for context embedding to weight each observed item in a transaction without assuming order. The empirical study on real-world transaction datasets proves that ATEM significantly outperforms the state-of-the-art methods in terms of both accuracy and novelty.
Pang, G, Cao, L, Chen, L, Lian, D & Liu, H 2018, 'Sparse Modeling-based Sequential Ensemble Learning for Effective Outlier Detection in High-dimensional Numeric Data.', The Thirty-Second AAAI Conference on Artificial Intelligence, AAAI Conference on Artificial Intelligence, AAAI, New Orleans, USA, pp. 3892-3899.
The large proportion of irrelevant or noisy features in real-life high-dimensional data presents a significant challenge to subspace/feature selection-based high-dimensional outlier detection (a.k.a. outlier scoring) methods. These methods often perform the two dependent tasks: relevant feature subset search and outlier scoring independently, consequently retaining features/subspaces irrelevant to the scoring method and downgrading the detection performance. This paper introduces a novel sequential ensemble-based framework SEMSE and its instance CINFO to address this issue. SEMSE learns the sequential ensembles to mutually refine feature selection and outlier scoring by iterative sparse modeling with outlier scores as the pseudo target feature. CINFO instantiates SEMSE by using three successive recurrent components to build such sequential ensembles. Given outlier scores output by an existing outlier scoring method on a feature subset, CINFO first defines a Cantelli's inequality-based outlier thresholding function to select outlier candidates with a false positive upper bound. It then performs lasso-based sparse regression by treating the outlier scores as the target feature and the original features as predictors on the outlier candidate set to obtain a feature subset that is tailored for the outlier scoring method. Our experiments show that two different outlier scoring methods enabled by CINFO (i) perform significantly better on 11 real-life high-dimensional data sets, and (ii) have much better resilience to noisy features, compared to their bare versions and three state-of-the-art competitors. The source code of CINFO is available at https://sites.google.com/site/gspangsite/sourcecode.
Pang, G, Chen, L, Cao, L & Liu, H 2018, 'Learning representations of ultrahigh-dimensional data for random distance-based outlier detection', Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, London, United Kingdom, pp. 2041-2050.View/Download from: Publisher's site
© 2018 Association for Computing Machinery. Learning expressive low-dimensional representations of ultrahigh-dimensional data, e.g., data with thousands/millions of features, has been a major way to enable learning methods to address the curse of dimensionality. However, existing unsupervised representation learning methods mainly focus on preserving the data regularity information and learning the representations independently of subsequent outlier detection methods, which can result in suboptimal and unstable performance of detecting irregularities (i.e., outliers). This paper introduces a ranking model-based framework, called RAMODO, to address this issue. RAMODO unifies representation learning and outlier detection to learn low-dimensional representations that are tailored for a state-of-the-art outlier detection approach - the random distance-based approach. This customized learning yields more optimal and stable representations for the targeted outlier detectors. Additionally, RAMODO can leverage little labeled data as prior knowledge to learn more expressive and application-relevant representations. We instantiate RAMODO to an efficient method called REPEN to demonstrate the performance of RAMODO. Extensive empirical results on eight real-world ultrahigh dimensional data sets show that REPEN (i) enables a random distance-based detector to obtain significantly better AUC performance and two orders of magnitude speedup; (ii) performs substantially better and more stably than four state-of-the-art representation learning methods; and (iii) leverages less than 1% labeled data to achieve up to 32% AUC improvement.
Jian, S, Cao, L, Pang, G, Lu, K & Gao, H 2017, 'Embedding-based representation of categorical data by hierarchical value coupling learning', IJCAI International Joint Conference on Artificial Intelligence, International Joint Conference on Artificial Intelligence, International Joint Conferences on Artificial Intelligence Organization, Melbourne, Australia, pp. 1937-1943.
Learning the representation of categorical data with hierarchical value coupling relationships is very challenging but critical for the effective analysis and learning of such data. This paper proposes a novel coupled unsupervised categorical data representation (CURE) framework and its instantiation, i.e., a coupled data embedding (CDE) method, for representing categorical data by hierarchical valueto-value cluster coupling learning. Unlike existing embedding- and similarity-based representation methods which can capture only a part or none of these complex couplings, CDE explicitly incorporates the hierarchical couplings into its embedding representation. CDE first learns two complementary feature value couplings which are then used to cluster values with different granularities. It further models the couplings in value clusters within the same granularity and with different granularities to embed feature values into a new numerical space with independent dimensions. Substantial experiments show that CDE significantly outperforms three popular unsupervised embedding methods and three state-of-the-art similarity-based representation methods.
Kim, J, Shim, K, Cao, L, Lee, JG, Lin, X & Moon, YS 2017, 'Advances in Knowledge Discovery and Data Mining', Lecture Notes in Computer Science, Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer Verlag, Jeju, South Korea.View/Download from: Publisher's site
Kim, J, Shim, K, Cao, L, Lee, JG, Lin, X & Moon, YS 2017, 'PC Chairs' Preface', Advances in Knowledge Discovery and Data Mining (LNAI), Pacific-Asia Conference on Knowledge Discovery and Data Mining, Jeju, South Korea, pp. i-i.View/Download from: Publisher's site
It is our great pleasure to introduce the proceedings of the 21st Pacific-Asia Conference
on Knowledge Discovery and Data Mining (PAKDD 2017).
We received a record-breaking number of 458 submissions from 36 countries all
over the world. This highest number of submissions is very encouraging because it
reflects the improving status of PAKDD. To rigorously review the submissions, we
conducted a double-blind review following the tradition of PAKDD and constructed
the largest ever committee consisting of 38 Senior Program Committee (SPC) members
and 196 Program Committee (PC) members. Each valid submission was reviewed by
three PC members and meta-reviewed by one SPC member who also led the discussion
with the PC members. We, the PC co-chairs, considered the recommendations from the
SPC members and looked into each submission as well as its reviews to make the final
decisions. Borderline papers were thoroughly discussed by us before final decisions
As a result, 129 out of 458 papers were accepted, yielding an acceptance rate of
28.2%. Among them, 45 papers were selected as long-presentation papers, and 84
papers were selected as regular-presentation papers. Mining social networks or graph
data was the most popular topic in the accepted papers. The review process was
supported by the Microsoft CMT system. During the three main conference days, these
129 papers were presented in 23 research sessions. A long-presentation paper was
given 25 minutes for presentation, and a regular-presentation paper was given 15
minutes for presentation. These two types of papers, however, are not distinguished in
We would like to thank all SPC members, PC members, and external reviewers for
their hard work to provide us with thoughtful and comprehensive reviews and recommendations.
Also, we would like to express our sincere thanks to Yang-Sae Moon
for compiling all accepted papers and for working with the Springer team to produce
We hope that ...
Lian, D, Liu, R, Ge, Y, Zheng, K, Xie, X & Cao, L 2017, 'Discrete Content-aware Matrix Factorization', KDD'17: PROCEEDINGS OF THE 23RD ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), ASSOC COMPUTING MACHINERY, Halifax, CANADA, pp. 325-334.View/Download from: Publisher's site
Pang, G, Xu, H, Cao, L & Zhao, W 2017, 'Selective value coupling learning for detecting outliers in high-dimensional categorical data', International Conference on Information and Knowledge Management, Proceedings, Conference on Information and Knowledge Management, Singapore, Singapore, pp. 807-816.View/Download from: Publisher's site
© 2017 Association for Computing Machinery. This paper introduces a novel framework, namely SelectVC and its instance POP, for learning selective value couplings (i.e., interactions between the full value set and a set of outlying values) to identify outliers in high-dimensional categorical data. Existing outlier detection methods work on a full data space or feature subspaces that are identified independently from subsequent outlier scoring. As a result, they are significantly challenged by overwhelming irrelevant features in high-dimensional data due to the noise brought by the irrelevant features and its huge search space. In contrast, SelectVC works on a clean and condensed data space spanned by selective value couplings by jointly optimizing outlying value selection and value outlierness scoring. Its instance POP defines a value outlierness scoring function by modeling a partial outlierness propagation process to capture the selective value couplings. POP further defines a top-k outlying value selection method to ensure its scalability to the huge search space. We show that POP (i) significantly outperforms five state-of-the-art full space or subspace-based outlier detectors and their combinations with three feature selection methods on 12 real-world high-dimensional data sets with different levels of irrelevant features; and (ii) obtains good scalability, stable performance w.r.t. k, and fast convergence rate.
Shi, Y, Li, W, Gao, Y, Cao, L & Shen, D 2017, 'Beyond IID: Learning to combine Non-IID metrics for vision tasks', 31st AAAI Conference on Artificial Intelligence, AAAI 2017, AAAI Conference on Artificial Intelligence, AAAI, San Francisco, USA, pp. 1524-1531.
Copyright © 2017, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Metric learning has been widely employed, especially in various computer vision tasks, with the fundamental assumption that all samples (e.g., regions/superpixels in images/videos) are independent and identically distributed (IID). However, since the samples are usually spatially-connected or temporally-correlated with their physically-connected neighbours, they are not IID (non-IID for short), which cannot be directly handled by existing methods. Thus, we propose to learn and integrate non-IID metrics (NIME). To incorporate the non-IID spatial/temporal relations, instead of directly using non-IID features and metric learning as previous methods, NIME first builds several non-IID representations on original (non-IID) features by various graph kernel functions, and then automatically learns the metric under the best combination of various non-IID representations. NIME is applied to solve two typical computer vision tasks: interactive image segmentation and histology image identification. The results show that learning and integrating non-IID metrics improves the performance, compared to the IID methods. Moreover, our method achieves results comparable or better than that of the state-of-the-arts.
Wang, S, Hu, L & Cao, L 2017, 'Perceiving the Next Choice with Comprehensive Transaction Embeddings for Online Recommendation', Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Springer Link, Skopje, Macedonia, pp. 285-302.View/Download from: Publisher's site
© 2017, Springer International Publishing AG. To predict customer's next choice in the context of what he/she has bought in a session is interesting and critical in the transaction domain especially for online shopping. Precise prediction leads to high quality recommendations and thus high benefit. Such kind of recommendation is usually formalized as transaction-based recommender systems (TBRS). Existing TBRS either tend to recommend popular items while ignore infrequent and newly-released ones (e.g., pattern-based RS) or assume a rigid order between items within a transaction (e.g., Markov Chain-based RS) which does not satisfy real-world cases in most time. In this paper, we propose a neural network-based comprehensive transaction embedding model (NTEM) which can effectively perceive the next choice in a transaction context. Specifically, we learn these comprehensive embeddings of both items and their features from relaxed ordered transactions. The relevance between items revealed by the transactions is encoded into such embeddings. With rich information embedded, such embeddings are powerful to predict the next choices given those already bought items. NTEM is a shallow wide-in-wide-out network, which is more efficient than deep networks considering large numbers of items and transactions. Experimental results on real-world datasets show that NTEM outperforms three typical TBRS models FPMC, PRME and GRU4Rec in terms of recommendation accuracy and novelty. Our implementation is available at https://github.com/shoujin88/NTEM-model.
Hu, L, Cao, L, Wang, S, Xu, G, Cao, J & Gu, Z 2017, 'Diversifying personalized recommendation with user-session context', IJCAI International Joint Conference on Artificial Intelligence, International Joint Conferences on Artifical Intelligence, International Joint Conferences on Artificial Intelligence Organization, Melbourne, Australia, pp. 1858-1864.
Recommender systems (RS) have become an integral part of our daily life. However, most current RS often repeatedly recommend items to users with similar profiles. We argue that recommendation should be diversified by leveraging session contexts with personalized user profiles. For this, current session-based RS (SBRS) often assume a rigidly ordered sequence over data which does not fit in many real-world cases. Moreover, personalization is often omitted in current SBRS. Accordingly, a personalized SBRS over relaxedly ordered user-session contexts is more pragmatic. In doing so, deep-structured models tend to be too complex to serve for online SBRS owing to the large number of users and items. Therefore, we design an efficient SBRS with shallow wide-in-wide-out networks, inspired by the successful experience in modern language modelings. The experiments on a real-world e-commerce dataset show the superiority of our model over the state-of-the-art methods.
Pang, G, Cao, L, Chen, L & Liu, H 2017, 'Learning homophily couplings from non-IID data for joint feature selection and noise-resilient outlier detection', IJCAI International Joint Conference on Artificial Intelligence, International Joint Conference on Artificial Intelligence, International Joint Conference on Artificial Intelligence Organization, Melbourne, Australia, pp. 2585-2591.
This paper introduces a novel wrapper-based outlier detection framework (WrapperOD) and its instance (HOUR) for identifying outliers in noisy data (i.e., data with noisy features) with strong couplings between outlying behaviors. Existing subspace or feature selection-based methods are significantly challenged by such data, as their search of feature subset(s) is independent of outlier scoring and thus can be misled by noisy features. In contrast, HOUR takes a wrapper approach to iteratively optimize the feature subset selection and outlier scoring using a top-κ outlier ranking evaluation measure as its objective function. HOUR learns homophily couplings between outlying behaviors (i.e., abnormal behaviors are not independent - they bond together) in constructing a noise-resilient outlier scoring function to produce a reliable outlier ranking in each iteration. We show that HOUR (i) retains a 2-approximation outlier ranking to the optimal one; and (ii) significantly outperforms five stateof-the-art competitors on 15 real-world data sets with different noise levels in terms of AUC and/or P@n. The source code of HOUR is available at https://sites.google.com/site/gspangsite/sourcecode.
Cao, L & Zhu, H 2016, 'Message from General Chairs', Proceedings - 2016 IEEE 2nd International Conference on Big Data Computing Service and Applications, BigDataService 2016, p. x.View/Download from: Publisher's site
Gaussier, E & Cao, L 2016, 'Conference Report on 2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA'2015) [Conference Reports]', IEEE Computational Intelligence Magazine, pp. 13-14.View/Download from: Publisher's site
The 2015 Data Science and Advanced Analytics (DSAA) was held on 19-21 October, 2015 in Paris, France. It was fully sponsored by the IEEE Computational Intelligence Society and also in partnership with the ACM SIGKDD. DSAA was the only IEEE/ACM jointly sponsored conference truly in data science, big data and advanced analytics, providing a premier forum for inter-disciplinary research across statistics, data science, computational intelligence, and business, and for bridging the gaps between academia and industry.
Kumar, KD, Reddy, PK, Reddy, PB & Cao, L 2016, 'Improving the performance of collaborative filtering with category-specific neighborhood', Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Asian Conference on Intelligent Information and Database Systems (ACIIDS), Springer, Da Nang, Vietnam, pp. 201-210.View/Download from: Publisher's site
© Springer-Verlag Berlin Heidelberg 2016.Recommender system (RS) helps customers to select appropriate products from millions of products and has become a key component in e-commerce systems. Collaborative filtering (CF) based approaches are widely employed to build RSs. In CF, recommendation to the target user is computed after forming the corresponding neighbourhood of users. Neighborhood of a target user is extracted based on the similarity between the product rating vector of the target user and the product rating vectors of individual users. In CF, the methodology employed for neighborhood formation influences the performance. In this paper, we have made an effort to improve the performance of CF by proposing a different approach to compute recommendations by considering two kinds of neighborhood. One is the neighborhood by considering the product ratings of the user as a single vector and the other is based on the neighborhood of the corresponding virtual users. For the target user, the virtual users are formed by dividing the ratings based on the category of products. We have proposed a combined approach to compute better recommendations by considering both kinds of neighborhoods. The experiments results on real world MovieLens dataset show that the proposed approach improves the performance over CF.
Shao, J, Meng, X & Cao, L 2016, 'Mining Actionable Combined High Utility Incremental and Associated Patterns', 2016 IEEE/CSAA International Conference on Aircraft Utility Systems (AUS), IEEE/CSAA International Conference on Aircraft Utility Systems, IEEE, Beijing,China, pp. 1164-1169.View/Download from: Publisher's site
High Utility Itemsets(HUI) Mining, instead of Frequent Pattern Mining (FIM), has been an attractive theme in data mining domain for over a decade since it can be regarded as an alternative way for researchers to identify actionable patterns. In addition, the necessity of decision-making actions and behavior-oriented strategies based on large amount of informative data impels the significance of discovering actionable patterns to be widely admitted. The current HUIM research focus has been on improving the efficiency to make algorithms faster and more stable. However, the coupling relationships between items in given itemsets are ignored. For example, the utility of one itemset might be lower than the manager expected until one additional item takes part in; and vice versa, the utility of an itemset might drop sharply when another one joins in. What's more, it is not occasional to find out that quite a lot of redundant itemsets sharing the same underlying item are presented based on existing academic HUI mining methods. Store managers would not make expected profits based on such results which makes the results not actionable at all. To this end, here we introduce a new framework for mining actionable patterns, called Mining Utility Associated Patterns (MUAP), which aims to find high utility incremental and strongly associated item/itemset with combined incorporating criteria. The outputs of this algorithm are convincing on real datasets as well as synthetic datasets.
Chen, Q, Hu, L, Xu, J, Liu, W & cao, L 2015, 'Document Similarity Analysis via Involving Both Explicit and Implicit Semantic Couplings', Proceedings of the 2015 IEEE International Conference on Data Science and Advanced Analytics, DSAA 2015, International Conference on Data Science and Advanced Analytics, IEEE, Paris.View/Download from: Publisher's site
Wang, S, Liu, W, Wu, J, Cao, L, Meng, Q & Kennedy, PJ 2016, 'Training deep neural networks on imbalanced data sets', Proceedings of the International Joint Conference on Neural Networks, IEEE International Joint Conference on Neural Networks, IEEE, Vancouver, Canada, pp. 4368-4374.View/Download from: Publisher's site
© 2016 IEEE.Deep learning has become increasingly popular in both academic and industrial areas in the past years. Various domains including pattern recognition, computer vision, and natural language processing have witnessed the great power of deep networks. However, current studies on deep learning mainly focus on data sets with balanced class labels, while its performance on imbalanced data is not well examined. Imbalanced data sets exist widely in real world and they have been providing great challenges for classification tasks. In this paper, we focus on the problem of classification using deep network on imbalanced data sets. Specifically, a novel loss function called mean false error together with its improved version mean squared false error are proposed for the training of deep networks on imbalanced data sets. The proposed method can effectively capture classification errors from both majority class and minority class equally. Experiments and comparisons demonstrate the superiority of the proposed approach compared with conventional methods in classifying imbalanced data sets on deep neural networks.
Pang, G, Cao, L & Chen, L 2016, 'Outlier detection in complex categorical data by modelling the feature value couplings', Proceedings of the 25th International Joint Conference on Artificial Intelligence, International Joint Conference on Artificial Intelligence (IJCAI), AAAI Press, New York.
Pang, G, Cao, L, Chen, L & Liu, H 2016, 'Unsupervised Feature Selection for Outlier Detection by Modelling Hierarchical Value-Feature Couplings', Proceedings - IEEE International Conference on Data Mining, ICDM, IEEE International Conference on Data Mining, IEEE, Barcelona.View/Download from: Publisher's site
Fan, X, Xu, RYD & Cao, L 2016, 'Copula mixed-membership stochastic block model', IJCAI International Joint Conference on Artificial Intelligence, International Joint Conference on Artificial Intelligence, AAAI Press / International Joint Conferences on Artificial Intelligence, New York City, New York, United States, pp. 1462-1468.
The Mixed-Membership Stochastic Blockmodels (MMSB) is a popular framework for modelling social relationships by fully exploiting each individual node's participation (or membership) in a social network. Despite its powerful representations, MMSB assumes that the membership indicators of each pair of nodes (i.e., people) are distributed independently. However, such an assumption often does not hold in real-life social networks, in which certain known groups of people may correlate with each other in terms of factors such as their membership categories. To expand MMSB's ability to model such dependent relationships, a new framework - a Copula Mixed-Membership Stochastic Blockmodel - is introduced in this paper for modeling intra-group correlations, namely an individual Copula function jointly models the membership pairs of those nodes within the group of interest. This framework enables various Copula functions to be used on demand, while maintaining the membership indicator's marginal distribution needed for modelling membership indicators with other nodes outside of the group of interest. Sampling algorithms for both the finite and infinite number of groups are also detailed. Our experimental results show its superior performance in capturing group interactions when compared with the baseline models on both synthetic and real world datasets.
Cao, L, Zhang, C, Joachims, T, Webb, G, Margineantu, D, Williams, G, Parekh, R, Fayyad, U, Eliassi-Rad, T, Fürnkranz, J, Pei, J, Zhou, ZH, Bekkerman, R & Tang, J 2015, 'Foreword', Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. iii-iv.
Cao, W, Demazeau, Y, Cao, L & Zhu, W 2015, 'Financial crisis and global market couplings', Proceedings of the 2015 IEEE International Conference on Data Science and Advanced Analytics, DSAA 2015, IEEE International Conference on Data Science and Advanced Analytics, IEEE, France.View/Download from: Publisher's site
© 2015 IEEE. The global financial crisis occurred in 2007 and its severe damaging consequences on other global financial markets, show the great importance of understanding the impact and contagion between different financial markets. A variety of methods have been proposed and implemented on market contagion. However, most of the existing literature simply test the existence of market contagion in financial crisis, and there is limited work go deep to investigate the complex market couplings which are the essence of market contagion. This is indeed very difficult as it involves the selection of discriminative indicators, the different types of couplings (intra-market coupling, inter-market coupling), the hidden characteristic of couplings, and the evaluation of market couplings in understanding crisis. To address these issues, this paper proposes a CHMM-LR framework to investigate the relations between financial crisis and three pairwise market couplings from three typical global financial markets: Equity market, Commodity market and Interest market. We adopt Coupled Hidden Markov Model (CHMM) to capture the complex hidden pairwise market couplings, and the financial crisis forecasting abilities based on different pairwise market couplings are imported to measure the relations by Logistic Regression (LR). Experiments of real financial data during the period 1990 to 2010 show the advantages of market couplings in understanding crisis. In addition, the experimental results provide crucial interpretation for the 2008 global financial crisis periods identification.
Gao, J, Yu, S & Cao, L 2015, 'Message from the IEEE BigDataService 2015 program chairs', Proceedings - 2015 IEEE 1st International Conference on Big Data Computing Service and Applications, BigDataService 2015, p. xii.View/Download from: Publisher's site
Gaussier, E, Cao, L, Cappé, O, Wang, W, Gallinari, P, Kwok, J, Pasi, G & Zaiane, O 2015, 'Welcome from IEEE DSAA'2015 chairs', Proceedings of the 2015 IEEE International Conference on Data Science and Advanced Analytics, DSAA 2015, pp. III-IV.View/Download from: Publisher's site
Liu, C & Cao, L 2015, 'A coupled k-nearest neighbor algorithm for multi-label classification', Advances in Knowledge Discovery and Data Mining - LNCS, Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer, Hi Chi Minh City, Vietnam, pp. 176-187.View/Download from: Publisher's site
© Springer International Publishing Switzerland 2015. ML-kNN is a well-known algorithm for multi-label classification. Although effective in some cases, ML-kNN has some defect due to the fact that it is a binary relevance classifier which only considers one label every time. In this paper, we present a new method for multi-label classification, which is based on lazy learning approaches to classify an unseen instance on the basis of its k nearest neighbors. By introducing the coupled similarity between class labels, the proposed method exploits the correlations between class labels, which overcomes the shortcoming of ML-kNN. Experiments on benchmark data sets show that our proposed Coupled Multi-Label k Nearest Neighbor algorithm (CML-kNN) achieves superior performance than some existing multi-label classification algorithms.
Wei, C, Hu, L & Cao, L 2015, 'Deep Modeling Complex Couplings within Financial Markets', AAAI'15 Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, AAAI Conference on Artificial Intelligence, AAI Press, Austin, Texas, USA, pp. 2518-2524.
The global financial crisis occurred in 2008 and its contagion to other regions, as well as the long-lasting impact on different markets, show that it is increasingly important to understand the complicated coupling relationships across financial markets. This is indeed very difficult as complex hidden coupling relationships exist between different financial markets in various countries, which are very hard to model. The couplings involve interactions between homogeneous markets from various countries (we call intra-market coupling), interactions between heterogeneous markets (inter-market coupling) and interactions between current and past market behaviors (temporal coupling). Very limited work has been done towards modeling such complex couplings, whereas some existing methods predict market movement by simply aggregating indicators from various markets but ignoring the inbuilt couplings. As a result, these methods are highly sensitive to observations, and may often fail when financial indicators change slightly. In this paper, a coupled deep belief network is designed to accommodate the above three types of couplings across financial markets. With a deep-architecture model to capture the high-level coupled features, the proposed approach can infer market trends. Experimental results on data of stock and currency markets from three countries show that our approach outperforms other baselines, from both technical and business perspectives.
Yu, PS, Cao, L, Zeng, Y, An, B, Symeonidis, AL, Gorodetsky, V & Coenen, F 2015, 'Message from the workshop chairs - 2014 International Workshopon Agents and Data Mining Interaction (ADMI 2014)', Lecture Notes in Artificial Intelligence (Subseries of Lecture Notes in Computer Science), 2014 International Workshop on Agents and Data Mining Interaction (ADMI 2014), Springer, Paris, France, pp. v-vi.View/Download from: Publisher's site
We are pleased to welcome you to the proceedings of the 2014 International Workshop
on Agents and Data Mining Interaction (ADMI 2014), held jointly with AAMAS 2014.
In recent years, agents and data mining interaction (ADMI, or agent mining) has
emerged as a very promising research field. Following the success of previous ADMIs,
ADMI 2014 provided a premier forum for sharing research and engineering results, as
well as potential challenges and prospects encountered in the coupling between agents
and data mining
Yue, X, Cao, L, Chen, Y & Xu, B 2015, 'Multi-View Actionable Patterns for Managing Traffic Bottleneck', Artificial Intelligence for Transportation: Advice, Interactivity and Actor Modeling: Papers from the 2015 AAAI Workshop, AAAI Conference on Artificial Intelligence, AAAI Press, Austin, Texas, USA, pp. 64-70.
Discovering congestion patterns from table-formed traf-
fic reports is critical for traffic bottleneck analysis.
However, patterns mined by existing algorithms often
do not satisfy user requirements and are not actionable
for traffic management. Traffic officers may not pursue
the most frequent patterns but expect mining outcomes
showing the dependence between congestion and various
kinds of road properties for traffic planning. Such
multi-view analysis requires to integrate user preferences
of data attributes into pattern mining process. To
tackle this problem, we propose a multi-view attributes
reduction model for discovering the patterns of user interests,
in which user views are interpreted as preferred
attributes and formulated by attribute orders. Based on
the pattern discovery model, a workflow is built for traf-
fic bottleneck analysis, which consists of data preprocessing,
preference representation and congestion pattern
mining. Our approach is validated on the reports
of road conditions from Shanghai, which shows that the
resultant multi-view findings are effective for analyzing
congestion causes and traffic management.
Zhou, X, Chen, L, Zhang, Y, Cao, L, Huang, G & Wang, C 2015, 'Online Video Recommendation in Sharing Community', Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, ACM Special Interest Group on Management of Data Conference, ACM, Melbourne, Victoria, Australia, pp. 1645-1656.View/Download from: Publisher's site
The creation of sharing communities has resulted in the astonishing increasing of digital videos, and their wide applications in the domains such as entertainment, online news broadcasting etc. The improvement of these applications relies on effective solutions for social user access to video data. This fact has driven the recent research interest in social recommendation in shared communities. Although certain effort has been put into video recommendation in shared communities, the contextual information on social users has not been well exploited for effective recommendation. In this paper, we propose an approach based on the content and social information of videos for the recommendation in sharing communities. Specifically, we first exploit a robust video cuboid signature together with the Earth Mover's Distance to capture the content relevance of videos. Then, we propose to identify the social relevance of clips using the set of users belonging to a video. We fuse the content relevance and social relevance to identify the relevant videos for recommendation. Following that, we propose a novel scheme called sub-community-based approximation together with a hash-based optimization for improving the efficiency of our solution. Finally, we propose an algorithm for efficiently maintaining the social updates in dynamic shared communities. The extensive experiments are conducted to prove the high effectiveness and efficiency of our proposed video recommendation approach.
Jiang, X, Liu, W, Cao, L & Long, G 2015, 'Coupled Collaborative Filtering for Context-aware Recommendation', AAAI Publications, Twenty-Ninth AAAI Conference on Artificial Intelligence, Student Abstracts, AAAI Conference on Artificial Intelligence, AAAI, Austin Texas, USA, pp. 4172-4173.
Context-aware features have been widely recognized as important factors in recommender systems. However, as a major technique in recommender systems, traditional Collaborative Filtering (CF) does not provide a straight-forward way of integrating the context-aware information into personal recommendation. We propose a Coupled Collaborative Filtering (CCF) model to measure the contextual information and use it to improve recommendations. In the proposed approach, coupled similarity computation is designed to be calculated by interitem, intra-context and inter-context interactions among item, user and context-ware factors. Experiments based on different types of CF models demonstrate the effectiveness of our design.
Shao, J, Yin, J, Liu, W & Cao, L 2015, 'Actionable Combined High Utility Itemset Mining', AAAI'15 Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, AAAI Conference on Artificial Intelligence, AAAI Press, Austin, Texas, USA, pp. 4206-4207.
Shao, J, Yin, J, Liu, W & Cao, L 2015, 'Mining Actionable Combined Patterns of High Utility and Frequency', Proceedings of the IEEE International Conference on Data Science and Advanced Analytics, IEEE International Conference on Data Science and Advanced Analytics, IEEE, Paris, pp. 1-10.View/Download from: Publisher's site
In recent years, the importance of identifying actionable patterns has become increasingly recognized so that decision-support actions can be inspired by the resultant patterns. A typical shift is on identifying high utility rather than highly frequent patterns. Accordingly, High Utility Itemset (HUI) Mining methods have become quite popular as well as faster and more reliable than before. However, the current research focus has been on improving the efficiency while the coupling relationships between items are ignored. It is important to study item and itemset couplings inbuilt in the data. For example, the utility of one itemset might be lower than user-specified threshold until one additional itemset takes part in; and vice versa, an item's utility might be high until another one joins in. In this way, even though some absolutely high utility itemsets can be discovered, sometimes it is easily to find out that quite a lot of redundant itemsets sharing the same item are mined (e.g., if the utility of a diamond is high enough, all its supersets are proved to be HUIs). Such itemsets are not actionable, and sellers cannot make higher profit if marketing strategies are created on top of such findings. To this end, here we introduce a new framework for mining actionable high utility association rules, called Combined Utility-Association Rules (CUAR), which aims to find high utility and strong association of itemset combinations incorporating item/itemset relations. The algorithm is proved to be efficient per experimental outcomes on both real and synthetic datasets.
Fu, B, Xu, G, Cao, L, Wang, Z & Wu, Z 2015, 'Coupling multiple views of relations for recommendation', Advances in Knowledge Discovery and Data Mining - LNCS, Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer, Ho Chi Minh City, Vietnam, pp. 732-743.View/Download from: Publisher's site
© Springer International Publishing Switzerland 2015. Learning user/item relation is a key issue in recommender system, and existing methods mostly measure the user/item relation from one particular aspect, e.g., historical ratings, etc. However, the relations between users/items could be influenced by multifaceted factors, so any single type of measure could get only a partial view of them. Thus it is more advisable to integrate measures from different aspects to estimate the underlying user/item relation. Furthermore, the estimation of underlying user/item relation should be optimal for current task. To this end, we propose a novel model to couple multiple relations measured on different aspects, and determine the optimal user/item relations via learning the optimal way of integrating these relation measures. Specifically, matrix factorization model is extended in this paper by considering the relations between latent factors of different users/items. Experiments are conducted and our method shows good performance and outperforms other baseline methods.
Li, F, Xu, G & Cao, L 2015, 'Coupled Matrix Factorization within Non-IID Context', Proceedings, Part II, 19th Pacific-Asia Conference, PAKDD 2015, Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer, Ho Chi Minh City, Vietnam, pp. 707-719.View/Download from: Publisher's site
Recommender systems research has experienced different stages such as from
user preference understanding to content analysis. Typical recommendation
algorithms were built on the following bases: (1) assuming users and items are
IID, namely independent and identically distributed, and (2) focusing on
specific aspects such as user preferences or contents. In reality, complex
recommendation tasks involve and request (1) personalized outcomes to tailor
heterogeneous subjective preferences; and (2) explicit and implicit objective
coupling relationships between users, items, and ratings to be considered as
intrinsic forces driving preferences. This inevitably involves the non-IID
complexity and the need of combining subjective preference with objective
couplings hidden in recommendation applications. In this paper, we propose a
novel generic coupled matrix factorization (CMF) model by incorporating non-IID
coupling relations between users and items. Such couplings integrate the
intra-coupled interactions within an attribute and inter-coupled interactions
among different attributes. Experimental results on two open data sets
demonstrate that the user/item couplings can be effectively applied in RS and
CMF outperforms the benchmark methods.
Deng, Z, Jiang, Y, Cao, L & Wang, S 2014, 'Knowledge-leverage based TSK fuzzy system with improved knowledge transfer', IEEE International Conference on Fuzzy Systems, IEEE International Conference on Fuzzy Systems, IEEE, Beijing, China, pp. 178-185.View/Download from: Publisher's site
© 2014 IEEE. In this study, the improved knowledge-leverage based TSK fuzzy system modeling method is proposed in order to overcome the weaknesses of the knowledge-leverage based TSK fuzzy system (TSK-FS) modeling method. In particular, two improved knowledge-leverage strategies have been introduced for the parameter learning of the antecedents and consequents of the TSK-FS constructed in the current scene by transfer learning from the reference scene, respectively. With the improved knowledge-leverage learning abilities, the proposed method has shown the more adaptive modeling effect compared with traditional TSK fuzzy modeling methods and some related methods on the synthetic and real world datasets.
Liu, C, Cao, L & Yu, PS 2014, 'A hybrid coupled k-nearest neighbor algorithm on imbalance data', 2014 International Joint Conference on Neural Networks (IJCNN), IEEE International Joint Conference on Neural Networks, IEEE, Beijing, China, pp. 2011-2018.View/Download from: Publisher's site
© 2014 IEEE. The state-of-the-art classification algorithms rarely consider the relationship between the attributes in the data sets and assume the attributes are independently to each other (IID). However, in real-world data, these attributes are more or less interacted via explicit or implicit relationships. Although the classifiers for class-balanced data are relatively well developed, the classification of class-imbalanced data is not straightforward, especially for mixed type data which has both categorical and numerical features. Limited research has been conducted on the class-imbalanced data. Some algorithms mainly synthesize or remove instances to force the sizes of each class comparable, which may change the inherent data structure or introduces noise to the source data. While for the distance or similarity based algorithms, they ignored the relationship between features when computing the similarity. This paper proposes a hybrid coupled k-nearest neighbor classification algorithm (HC-kNN) for mixed type data, by doing discretization on numerical features to adapt the inter coupling similarity as we do on categorical features, then combing this coupled similarity to the original similarity or distance, to overcome the shortcoming of the previous algorithms. The experiment results demonstrate that our proposed algorithm can get a higher average performance than that of the relevant algorithms (e.g. the variants of kNN, Decision Tree, SMOTE and NaiveBayes).
Liu, C, Cao, L & Yu, PS 2014, 'Coupled fuzzy k-nearest neighbors classification of imbalanced non-IID categorical data', 2014 International Joint Conference on Neural Networks (IJCNN), IEEE International Joint Conference on Neural Networks, IEEE, Beijing, China, pp. 1122-1129.View/Download from: Publisher's site
© 2014 IEEE. Mining imbalanced data has recently received increasing attention due to its challenge and wide applications in the real world. Most of the existing work focuses on numerical data by manipulating the data structure which essentially changes the data characteristics or developing new distance or similarity measures which are designed for data with the so-called IID assumption, namely data is independent and identically distributed. This is not consistent with the real-life data and business needs, which request to fully respect the data structure and coupling relationships embedded in data objects, features and feature values. In this paper, we propose a novel coupled fuzzy similarity-based classification approach to cater for the difference between classes by a fuzzy membership and the couplings by coupled object similarity, and incorporate them into the most popular classifier: kNN to form a coupled fuzzy kNN (ie. CF-kNN). We test the approach on 14 categorical data sets compared to several kNN variants and classic classifiers including C4.5 and NaiveBayes. The experimental results show that CF-kNN outperforms the baselines, and those classifiers incorporated with the proposed coupled fuzzy similarity perform better than their original editions.
Meng, X, Cao, L & Shao, J 2014, 'Semantic approximate keyword query based on keyword and query coupling relationship analysis', CIKM 2014 - Proceedings of the 2014 ACM International Conference on Information and Knowledge Management, ACM International Conference on Information and Knowledge Management, ACM, Shanghai, China, pp. 529-538.View/Download from: Publisher's site
Due to imprecise query intention, Web database users often use a limited number of keywords that are not directly related to their precise query to search information. Semantic approximate keyword query is challenging but helpful for specifying such query intent and providing more relevant answers. By extracting the semantic relationships both between keywords and keyword queries, this paper proposes a new keyword query approach which generates semantic approximate answers by identifying a set of keyword queries from the query history whose semantics are related to the given keyword query. To capture the semantic relationships between keywords, a semantic coupling relationship analysis model is introduced to model both the intra- and inter - keyword couplings. Building on the coupling relationships between keywords, the semantic similarity of different keyword queries is then measured by a semantic matrix. The representative queries in query history are identified and then a priori order of remaining queries corresponding to each representative query in an off-line preprocessing step is created. These representative queries and associated orders are then used to expeditiously generate top-k ranked semantically related keyword queries. We demonstrate that our coupling relationship analysis model can accurately capture the semantic relationships both between keywords and queries. The efficiency of top-k keyword query selection algorithm is also demonstrated.
Wei, W, Yin, J, Li, J & Cao, L 2014, 'Modelling Asymmetry and Tail Dependence among Multiple Variables by Using Partial Regular Vine', Proceedings of the 2014 SIAM International Conference on Data Mining, SIAM International Conference on Data Mining, SIAM, Philadelphia, USA, pp. 776-784.View/Download from: Publisher's site
Modeling high-dimensional dependence is widely studied to explore deep relations in multiple variables particularly useful for financial risk assessment. Very often, strong restrictions are applied on a dependence structure by existing high-dimensional dependence models. These restrictions disabled the detection of sophisticated structures such as asymmetry, upper and lower tail dependence between multiple variables. The paper proposes a partial regular vine copula model to relax these restrictions. The new model employs partial correlation to construct the regular vine structure, which is algebraically independent. This model is also able to capture the asymmetric characteristics among multiple variables by using two-parametric copula with flexible lower and upper tail dependence. Our method is tested on a cross-country stock market data set to analyse the asymmetry and tail dependence. The high prediction performance is examined by the Value at Risk, which is a commonly adopted evaluation measure in financial market.
Yu, PS, Kitsuregawa, M, Motoda, H, Goethals, B, Guo, M, Cao, L, Karypis, G, King, I & Wang, W 2014, 'Welcome from DSAA 2014 chairs', DSAA 2014 - Proceedings of the 2014 IEEE International Conference on Data Science and Advanced Analytics, pp. 9-10.View/Download from: Publisher's site
Li, M, Li, J, Ou, Y, Zhang, Y, Luo, D, Bahtia, M & Cao, L 2012, 'Coupled K-nearest centroid classification for non-iid data', Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Transactions on Computational Collective Intelligence XV: International Conference on Practical Applications on Agents and Multi-Agent Systems, Springer Verlag, Salamanca, pp. 89-100.View/Download from: Publisher's site
Most traditional classification methods assume the independence and identical distribution (iid) of objects, attributes and values. However, real world data, such as multi-agent data and behavioral data, usually contains strong couplings among values, attributes and objects, which greatly challenges existing methods and tools. This work targets the coupling similarities from these three perspectives and designs a novel classification method that applies a weighted K-Nearest Centroid to obtain the coupled similarity for non-iid data. From value and attribute perspectives, coupled similarity serves as a metric for nominal objects, which consider not only intra-coupled similarity within an attribute but also inter-coupled similarity between attributes. From the object perspective, we propose a more effective method that measures the centroid object by connecting all related objects. Extensive experiments on UCI and student data sets reveal that the proposed method outperforms classical methods for higher accuracy, especially in imbalanced data.
Li, M, Li, J, Ou, Y, Zhang, Y, Luo, D, Bahtia, M & Cao, L 2014, 'Learning heterogeneous coupling relationships between non-IID terms', Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), International Workshop on Agents and Data Mining Interaction, Springer, Saint Paul, MN, pp. 79-91.View/Download from: Publisher's site
With the rapid proliferation of social media and online community, a vast amount of text data has been generated. Discovering the insightful value of the text data has increased its importance, a variety of text mining and process algorithms have been created in the recent years such as classification, clustering, similarity comparison. Most previous research uses a vector-space model for text representation and analysis. However, the vector-space model does not utilise the information about the relationships between the term to term. Moreover, the classic classification methods also ignore the relationships between each text document to another. In other word, the traditional text mining techniques assume the relation between terms and between documents are independent and identically distributed (iid). In this paper, we will introduce a novel term representation by involving the coupled relations from term to term. This coupled representation provides much richer information that enables us to create a coupled similarity metric for measuring document similarity, and a coupled document similarity based K-Nearest centroid classifier will be applied to the classification task. Experiments verify the proposed approach outperforming the classic vector-space based classifier, and show potential advantages and richness in exploring the other text mining tasks. © 2014 Springer-Verlag.
Hu, L, Cao, J, Xu, G, Cao, L, Gu, Z & Cao, W 2014, 'Deep modeling of group preferences for group-based recommendation', Proceedings of the National Conference on Artificial Intelligence, AAAI Conference on Artificial Intelligence, AI Access Foundation, Quebec, Canada, pp. 1861-1867.
Nowadays, most recommender systems (RSs) mainly aim to suggest appropriate items for individuals. Due to the social nature of human beings, group activities have become an integral part of our daily life, thus motivating the study on group RS (GRS). However, most existing methods used by GRS make recommendations through aggregating individual ratings or individual predictive results rather than considering the collective features that govern user choices made within a group. As a result, such methods are heavily sensitive to data, hence they often fail to learn group preferences when the data are slightly inconsistent with predefined aggregation assumptions. To this end, we devise a novel GRS approach which accommodates both individual choices and group decisions in a joint model. More specifically, we propose a deep-architecture model built with collective deep belief networks and dual-wing restricted Boltzmann machines. With such a deep model, we can use high-level features, which are induced from lower-level features, to represent group preference so as to relieve the vulnerability of data. Finally, the experiments conducted on a real-world dataset prove the superiority of our deep model over other state-of-the-art methods.
Hu, L, Cao, W, Cao, J, Xu, G, Cao, L & Gu, Z 2014, 'Bayesian Heteroskedastic Choice Modeling on Non-identically Distributed Linkages.', Proceedings of the 2014 IEEE International Conference on Data Mining, IEEE International Conference on Data Mining, IEEE, Shenzhen, China, pp. 851-856.View/Download from: Publisher's site
Choice modeling (CM) aims to describe and predict choices according to attributes of subjects and options. If we presume each choice making as the formation of link between subjects and options, immediately CM can be bridged to link analysis and prediction (LAP) problem. However, such a mapping is often not trivial and straightforward. In LAP problems, the only available observations are links among objects but their attributes are often inaccessible. Therefore, we extend CM into a latent feature space to avoid the need of explicit attributes. Moreover, LAP is usually based on binary linkage assumption that models observed links as positive instances and unobserved links as negative instances. Instead, we use a weaker assumption that treats unobserved links as pseudo negative instances. Furthermore, most subjects or options may be quite heterogeneous due to the long-tail distribution, which is failed to capture by conventional LAP approaches. To address above challenges, we propose a Bayesian heteroskedastic choice model to represent the non-identically distributed linkages in the LAP problems. Finally, the empirical evaluation on real-world datasets proves the superiority of our approach
Li, F, Xu, G & Cao, L 2014, 'Coupled Item-Based Matrix Factorization', Proceedings, Part I of the Web Information Systems Engineering - WISE 2014 - 15th International Conference, International Conference on Web Information Systems Engineering, Springer, Thessaloniki, Greece, pp. 1-14.View/Download from: Publisher's site
The essence of the challenges cold start and sparsity in Recommender Systems (RS) is that the extant techniques, such as Collaborative Filtering (CF) and Matrix Factorization (MF), mainly rely on the user-item rating matrix, which sometimes is not informative enough for predicting recommendations. To solve these challenges, the objective item attributes are incorporated as complementary information. However, most of the existing methods for inferring the relationships between items assume that the attributes are "independently and identically distributed (iid)", which does not always hold in reality. In fact, the attributes are more or less coupled with each other by some implicit relationships. Therefore, in this paper we propose an attribute-based coupled similarity measure to capture the implicit relationships between items. We then integrate the implicit item coupling into MF to form the Coupled Item-based Matrix Factorization (CIMF) model. Experimental results on two open data sets demonstrate that CIMF outperforms the benchmark methods.
Li, FANGFANG, Xu, G, Cao, LONGBING, Fan, X & Niu, Z 2014, 'CMF: Coupled Matrix Factorization for Recommender Systems.', WISE 2014: Web Information Systems Engineering – WISE 2014, International Conference on Web Information Systems Engineering.
Recommender systems research has experienced different stages such as from user preference understanding to content analysis. Typical recommendation algorithms were built on the following bases: (1) assuming users and items are IID, namely independent and identically distributed, and (2) focusing on specific aspects such as user preferences or contents. In reality, complex recommendation tasks involve and request (1) personalized outcomes to tailor heterogeneous subjective preferences; and (2) explicit and implicit objective coupling relationships between users, items, and ratings to be considered as intrinsic forces driving preferences. This inevitably involves the non-IID complexity and the need of combining subjective preference with objective couplings hidden in recommendation applications. In this paper, we propose a novel generic coupled matrix factorization (CMF) model by incorporating non-IID coupling relations between users and items. Such couplings integrate the intra-coupled interactions within an attribute and inter-coupled interactions among different attributes. Experimental results on two open data sets demonstrate that the user/item couplings can be effectively applied in RS and CMF outperforms the benchmark methods.
Cao, L 2012, 'Agents and Data Mining Interaction - 8th International Workshop, ADMI 2012', Agents and Data Mining Interaction - 8th International Workshop, ADMI 2012, Springer, Valencia, Spain.
Revised Selected Papers. Lecture Notes in Computer Science 7607, Springer 2013
Cao, W, Cao, L & Song, Y 2013, 'Coupled market behavior based financial crisis detection', The 2013 International Joint Conference on Neural Networks, IJCNN 2013, Dallas, TX, USA, August 4-9, 2013, IEEE International Joint Conference on Neural Networks, IEEE, Dallas, TX, USA, pp. 1-8.View/Download from: Publisher's site
Financial crisis detection is a long-standing challenging issue with significant practical values and impact on economy, society and globalization. The challenge lies in many aspects, in particular, the nonlinear and dynamic characteristics associated with financial crisis. Most of existing methods rely on selecting individual indicators associated with one market indicator, and the linear assumption is often behind the models for prediction. In practice, a linear assumption may be too strong to be applicable to the real market dynamics. More importantly, instruments in different markets such as gold price and petrol price are often coupled. A financial crisis may significantly change the couplings between different market indicators. In addition, such couplings in cross-market interaction are likely nonlinear. In this paper, we present a new approach for financial crisis detection by catering for the often nonlinear couplings between major indicators selected from different markets, called coupled market behavior analysis, to detect different coupled market behaviors at crisis and non-crisis periods. A Coupled Hidden Markov Model (CHMM) is built to characterize the coupled market behaviors of equity, commodity and interest markets as case studies. The empirical results show the need of catering for nonlinear couplings between various markets and the proposed approach is much more effective in capturing the coupling and nonlinear relations associated with financial crisis compared with other traditionally used approaches, such as Signal, Logistic and ANN models.
Cao, W, Wang, C & Cao, L 2012, 'Trading Strategy Based Portfolio Selection for Actionable Trading Agents', Agents and Data Mining Interaction - 8th International Workshop, ADMI 2012, International Workshop on Agents and Data Mining Interaction, Springer, Valencia, Spain, pp. 191-202.View/Download from: Publisher's site
Trading agents are very useful for supporting investors in making decisions in financial markets, but the existing trading agent research focuses on simulation on artificial data. This leads to limitations in its usefulness. As for investors, how trading agents help them manipulate their assets according to their risk appetite and thus obtain a higher return is a big issue. Portfolio optimization is an approach used by many researchers to resolve this issue, but the focus is mainly on developing more accurate mathematical estimation methods, and overlooks an important factor: trading strategy. Since the global financial crisis added uncertainty to financial markets, there is an increasing demand for trading agents to be more active in providing trading strategies that will better capture trading opportunities. In this paper, we propose a new approach, namely trading strategy based portfolio selection, by which trading agents combine assets and their corresponding trading strategies to construct new portfolios, following which, trading agents can help investors to obtain the optimal weights for their portfolios according to their risk appetite. We use historical data to test our approach, the results show that it can help investors make more profit according to their risk tolerance by selecting the best portfolio in real financial markets.
Cheng, X, Miao, D, Wang, C & Cao, L 2013, 'Coupled term-term relation analysis for document clustering', The 2013 International Joint Conference on Neural Networks, IJCNN 2013, Dallas, TX, USA, August 4-9, 2013, IEEE International Joint Conference on Neural Networks, IEEE, Dallas, TX, USA, pp. 1-8.View/Download from: Publisher's site
Traditional document clustering approaches are usually based on the Bag of Words model, which is limited by its assumption of the independence among terms. Recent strategies have been proposed to capture the relation between terms based on statistical analysis, and they estimate the relation between terms purely by their co-occurrence across the documents. However, the implicit interactions with other link terms are overlooked, which leads to the discovery of incomplete information. This paper proposes a coupled term-term relation model for document representation, which considers both the intra-relation (i.e. co-occurrence of terms) and inter-relation (i.e. dependency of terms via link terms) between a pair of terms. The coupled relation for each pair of terms is further used to map a document onto a new feature space, which includes more semantic information. Substantial experiments verify that the document clustering incorporated with our proposed relation achieves a significant performance improvement compared to the state-of-the-art techniques.
Fariha, A, Ahmed, CF, Leung, CK, Abdullah, SM & Cao, L 2013, 'Mining Frequent Patterns from Human Interactions in Meetings Using Directed Acyclic Graphs', Lecture Notes in Computer Science, Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer, Gold Coast, Australia, pp. 38-49.View/Download from: Publisher's site
In modern life, interactions between human beings frequently occur in meetings, where topics are discussed. Semantic knowledge of meetings can be revealed by discovering interaction patterns from these meetings. An existing method mines interaction patterns from meetings using tree structures. However, such a tree-based method may not capture all kinds of triggering relations between interactions, and it may not distinguish a participant of a certain rank from another participant of a different rank in a meeting. Hence, the tree-based method may not be able to find all interaction patterns such as those about correlated interaction. In this paper, we propose to mine interaction patterns from meetings using an alternative data structurenamely, a directed acyclic graph (DAG). Specifically, a DAG captures both temporal and triggering relations between interactions in meetings. Moreover, to distinguish one participant of a certain rank from another, we assign weights to nodes in the DAG. As such, a meeting can be modeled as a weighted DAG, from which weighted frequent interaction patterns can be discovered. Experimental results showed the effectiveness of our proposed DAG-based method for mining interaction patterns from meetings.
Li, J, Wang, C, Cao, L & Yu, P 2013, 'Efficient Selection of Globally Optimal Rules on Large Imbalanced Data Based on Rule Coverage Relationship Analysis', Proceedings of the 13th SIAM International Conference on Data Mining, SIAM International Conference on Data Mining, SIAM, Austin, Texas, USA, pp. 216-224.View/Download from: Publisher's site
Rule-based anomaly and fraud detection systems often suffer from massive false alerts against a huge number of enterprise transactions. A crucial and challenging problem is to effectively select a globally optimal rule set which can capture very rare anomalies dispersed in large-scale background transactions. The existing rule selection methods which suffer significantly from complex rule interactions and overlapping in large imbalanced data, often lead to very high false positive rate. In this paper, we analyze the interactions and relationships between rules and their coverage on transactions, and propose a novel metric, Max Coverage Gain. Max Coverage Gain selects the optimal rule set by evaluating the contribution of each rule in terms of overall performance to cut out those locally significant but globally redundant rules, without any negative impact on the recall. An effective algorithm, MCGminer, is then designed with a series of built-in mechanisms and pruning strategies to handle complex rule interactions and reduce computational complexity towards identifying the globally optimal rule set. Substantial experiments on 13 UCI data sets and a real time online banking transactional database demonstrate that MCGminer achieves significant improvement on both accuracy, scalability, stability and efficiency on large imbalanced data compared to several state-of-the-art rule selection techniques.
Li, W, Cao, L, Zhao, D, Cui, X & Yang, J 2013, 'CRNN: Integrating classification rules into neural network', The 2013 International Joint Conference on Neural Networks, IJCNN 2013, Dallas, TX, USA, August 4-9, 2013, IEEE International Joint Conference on Neural Networks, IEEE, Dallas, TX, USA, pp. 1-8.View/Download from: Publisher's site
Association classification has been an important type of the rule-based classification. A variety of approaches have been proposed to build a classifier based on classification rules. In the prediction stage of the extant approaches, most of the existing association classifiers use the ensemble quality measurement of each rule in a subset of rules to predict the class label of the new data. This method still suffers the following two problems. The classification rules are used individually thus the coupling relations between rules  are ignored in the prediction. However, in real-world rule set, rules are often inter-related and a new data object may partially satisfy many rules. Furthermore, the classification rule based prediction model lacks a general expression of the decision methodology. This paper proposes a classification method that integrating classification rules into neural network (CRNN, for short), which presents a general form of the rule based decision methodology by rule-based network. In comparison with the extant rule-based classifiers, such as C4.5, CBA, CMAR and CPAR, our approach has two advantages. First, CRNN takes the coupling relations between rules from the training data into account in the prediction step. Second, CRNN automatically obtains higher performance on the structure and parameter learning than traditional neural network. CRNN uses the linear computing algorithm in neural network instead of the costly iterative learning algorithm. Two ways of the classification rule set generation are conducted in this paper for the CRNN evaluation, and CRNN achieves the satisfactory performance.
Li, W, Zhao, D, Cao, L & Yang, J 2013, 'An Approach of Hierarchical Concept Clustering on Medical Short Text Corpus', 2013 6th International Conference on Biomedical Engineering and Informatics (BMEI 2013), International Conference on Biomedical Engineering and Informatics (BMEI), IEEE, Hangzhou, China, pp. 509-518.View/Download from: Publisher's site
Hierarchical clustering and conceptual clustering are two important types of clustering analysis methods. A variety of approaches have been proposed in previous works. However, seldom methods are designed to run on the medical short text database and construct a hierarchical concept taxonomy. This paper proposes a new clustering method of Hierarchical Concept Clustering on Medical Short Text corpus (HCCST), which presents a new solution on actionable disease taxonomy construction from the actual medical data. Our approach has three advantages. Firstly, HCCST takes a new similarity method which covers all the problems in medical short text distance computing. Secondly, an adaptive clustering method is proposed for synonymous disease names without predefining the size of clusters. Thirdly, this paper uses a mutual information based potential hierarchy concept pair recognition method which improves the subsumption method to create hierarchical disease taxonomy. The evaluation is conducted on Chinese medical disease name text data set and the result shows that HCCST achieves satisfactory performance.
Liu, B, Xiao, Y, Yu, P, Cao, L & Hao, Z 2013, 'Robust Textual Data Streams Mining Based on Continuous Transfer Learning', Proceedings of the 13th SIAM International Conference on Data Mining, SIAM International Conference on Data Mining, SIAM, Austin, Texas, USA, pp. 731-739.View/Download from: Publisher's site
In textual data stream environment, concept drift can occur at any time, existing approaches partitioning streams into chunks can have problem if the chunk boundary does not coincide with the change point which is impossible to predict. Since concept drift can occur at any point of the streams, it will certainly occur within chunks, which is called random concept drift. The paper proposed an approach, which is called chunk level-based concept drift method (CLCD), that can overcome this chunking problem by continuously monitoring chunk characteristics to revise the classifier based on transfer learning in positive and unlabeled (PU) textual data stream environment. Our proposed approach works in three steps. In the first step, we propose core vocabulary-based criteria to justify and identify random concept drift. In the second step, we put forward the extension of LELC (PU learning by extracting likely positive and negative micro-clusters), called soft-LELC, to extract representative examples from unlabeled data, and assign a confidence score to each extracted example. The assigned confidence score represents the degree of belongingness of an example towards its corresponding class. In the third step, we set up a transfer learning-based SVM to build an accurate classifier for the chunks where concept drift is identified in the first step. Extensive experiments have shown that CLCD can capture random concept drift, and outperforms state-of-the-art methods in positive and unlabeled textual data stream environments.
Song, Y, Cao, L, Yin, J & Wang, C 2013, 'Extracting discriminative features for identifying abnormal sequences in one-class mode', The 2013 International Joint Conference on Neural Networks, IJCNN 2013, Dallas, TX, USA, August 4-9, 2013, IEEE International Joint Conference on Neural Networks, IEEE, Dallas, TX, USA, pp. 1-8.View/Download from: Publisher's site
This paper presents a novel framework for detecting abnormal sequences in an one-class setting (i.e., only normal data are available), which is applicable to various domains. Examples include intrusion detection, fault detection and speaker verification. Detecting abnormal sequences with only normal data presents several challenges for anomaly detection: the weak discrimination of normal and abnormal sequences; the unavailability of the abnormal data and other issues. Traditional model-based anomaly detection techniques can solve some of the above issues but with limited discrimination power (because of directly modeling the normal data). In order to enhance the discriminative power for anomaly detection, we turn to extracting discriminative features from the generative model based on the principle deducted from the corresponding theoretical analysis. Then a new anomaly detection framework is developed on top of that. The proposed approach firstly projects all the sequential data into a model-based equal length feature space (this is theoretically proven to have better discriminative power than the model itself), and then adopts a classifier learned from the transformed data to detect anomalies. Experimental evaluation on both the synthetic and real-world data shows that our proposed approach outperforms several anomaly detection baseline algorithms for sequential data.
Wang, C, She, Z & Cao, L 2013, 'Coupled Attribute Analysis on Numerical Data', Proceedings of the 23rd International Joint Conference on Artificial Intelligence, International Joint Conference on Artificial Intelligence, IJCAI/AAAI, Beijing, China, pp. 1736-1742.
The usual representation of quantitative data is to formalize it as an information table, which assumes the independence of attributes. In real-world data, attributes are more or less interacted and coupled via explicit or implicit relationships. Limited research has been conducted on analyzing such attribute interactions, which only describe a local picture of attribute couplings in an implicit way. This paper proposes a framework of the coupled attribute analysis to capture the global dependency of continuous attributes. Such global couplings integrate the intra-coupled interaction within an attribute (i.e. the correlations between attributes and their own powers) and inter-coupled interaction among different attributes (i.e. the correlations between attributes and the powers of others) to form a coupled representation for numerical objects by the Taylor-like expansion. This work makes one step forward towards explicitly addressing the global interactions of continuous attributes, verified by the applications in data structure analysis, data clustering, and data classification. Substantial experiments on 13 UCI data sets demonstrate that the coupled representation can effectively capture the global couplings of attributes and outperforms the traditional way, supported by statistical analysis.
Wang, C, She, Z & Cao, L 2013, 'Coupled clustering ensemble: Incorporating coupling relationships both between base clusterings and objects', Proceedings of the 29th IEEE International Conference on Data Engineering, IEEE International Conference on Data Engineering, IEEE, Brisbane, Australia, pp. 374-385.View/Download from: Publisher's site
Clustering ensemble is a powerful approach for improving the accuracy and stability of individual (base) clustering algorithms. Most of the existing clustering ensemble methods obtain the final solutions by assuming that base clusterings perform independently with one another and all objects are independent too. However, in real-world data sources, objects are more or less associated in terms of certain coupling relationships. Base clusterings trained on the source data are complementary to one another since each of them may only capture some specific rather than full picture of the data. In this paper, we discuss the problem of explicating the dependency between base clusterings and between objects in clustering ensembles, and propose a framework for coupled clustering ensembles (CCE). CCE not only considers but also integrates the coupling relationships between base clusterings and between objects. Specifically, we involve both the intra-coupling within one base clustering (i.e., cluster label frequency distribution) and the inter-coupling between different base clusterings (i.e., cluster label co-occurrence dependency). Furthermore, we engage both the intra-coupling between two objects in terms of the base clustering aggregation and the inter-coupling among other objects in terms of neighborhood relationship.
Wang, X, Chen, J, Cao, L & Meng, X 2013, 'The Foundation of Fuzzy Rule Interchange in the Semantic Web', Workshop Proceedings of 2013 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology, IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology, IEEE, Atlanta, Georgia, USA, pp. 280-281.View/Download from: Publisher's site
RIF (Rule Interchange Format) is a W3C's recommendation and an appropriate intermediary language for crisp (i.e., non fuzzy) rule interchange in the Semantic Web, but it is incapable of representing and interchanging fuzzy rules. Therefore, combining RIF and fuzzy sets, we propose f-RIF (fuzzy RIF), investigate its abstract syntax, concrete syntax and UML profile, and define its semantics, which lays a solid foundation for fuzzy rule interchange among heterogeneous fuzzy rule languages.
Wei, W, Li, J, Cao, L, Sun, J, Liu, C & Li, M 2013, 'Optimal Allocation of High Dimensional Assets through Canonical Vines', Advances in Knowledge Discovery and Data Mining: 17th Pacific-Asia Conference, PAKDD 2013, Gold Coast, Australia, April 14-17, 2013, Proceedings, Part I, Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer, Gold Coast, Australia, pp. 366-377.View/Download from: Publisher's site
Canonical Vine, Mean Variance Criterion, Financial Return.
Yin, J, Zheng, Z, Cao, L, Song, Y & Wei, W 2013, 'Efficiently Mining Top-K High Utility Sequential Patterns', 2013 IEEE 13th International Conference on Data Mining, IEEE International Conference on Data Mining, IEEE, Dallas, TX, USA, pp. 1259-1264.View/Download from: Publisher's site
High utility sequential pattern mining is an emerging topic in the data mining community. Compared to the classic frequent sequence mining, the utility framework provides more informative and actionable knowledge since the utility of a sequence indicates business value and impact. However, the introduction of "utility" makes the problem fundamentally different from the frequency-based pattern mining framework and brings about dramatic challenges. Although the existing high utility sequential pattern mining algorithms can discover all the patterns satisfying a given minimum utility, it is often difficult for users to set a proper minimum utility. A too small value may produce thousands of patterns, whereas a too big one may lead to no findings. In this paper, we propose a novel framework called top-k high utility sequential pattern mining to tackle this critical problem. Accordingly, an efficient algorithm, Top-k high Utility Sequence (TUS for short) mining, is designed to identify top-k high utility sequential patterns without minimum utility. In addition, three effective features are introduced to handle the efficiency problem, including two strategies for raising the threshold and one pruning for filtering unpromising items. Our experiments are conducted on both synthetic and real datasets. The results show that TUS incorporating the efficiency-enhanced strategies demonstrates impressive performance without missing any high utility sequential patterns
Yu, PS, Cao, L, Ras, Z, Wong, L, Jiang, F & Li, J 2013, 'Preface to the 2013 international workshop on domain driven data mining', Proceedings - IEEE 13th International Conference on Data Mining Workshops, ICDMW 2013.View/Download from: Publisher's site
The name of the 5th author has been printed incorrectly in the paper. Instead of "Xixi Chen" it should be "Qianqian Chen".
Yu, Y, Wang, C, Gao, Y, Cao, L & Chen, X 2013, 'A Coupled Clustering Approach for Items Recommendation', Lecture Notes in Computer Science, Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer, Gold Coast, Australia, pp. 365-376.View/Download from: Publisher's site
Recommender systems are very useful due to the huge volume of information available on the Web. It helps users alleviate the information overload problem by recommending users with the personalized information, products or services (called items). Collaborative filtering and content-based recommendation algorithms have been widely deployed in e-commerce web sites. However, they both suffer from the scalability problem. In addition, there are few suitable similarity measures for the content-based recommendation methods to compute the similarity between items. In this paper, we propose a hybrid recommendation algorithm by combing the content-based and collaborative filtering techniques as well as incorporating the coupled similarity. Our method firstly partitions items into several item groups by using a coupled version of k-modes clustering algorithm, where the similarity between items is measured by the Coupled Object Similarity considering coupling between items. The collaborative filtering technique is then used to produce the recommendations for active users. Experimental results show that our proposed hybrid recommendation algorithm effectively solves the scalability issue of recommender systems and provides a comparable recommendation quality when lacking most of the item features
Fu, B, Xu, G, Wang, Z & Cao, L 2013, 'Leveraging Supervised Label Dependency Propagation for Multi-label Learning', 2013 IEEE 13th International Conference on Data Mining, IEEE International Conference on Data Mining, IEEE, Dallas, TX, USA, pp. 1061-1066.View/Download from: Publisher's site
Exploiting label dependency is a key challenge in multi-label learning, and current methods solve this problem mainly by training models on the combination of related labels and original features. However, label dependency cannot be exploited dynamically and mutually in this way. Therefore, we propose a novel paradigm of leveraging label dependency in an iterative way. Specifically, each label's prediction will be updated and also propagated to other labels via an random walk with restart process. Meanwhile, the label propagation is implemented as a supervised learning procedure via optimizing a loss function, thus more appropriate label dependency can be learned. Extensive experiments are conducted, and the results demonstrate that our method can achieve considerable improvements in terms of several evaluation metrics.
Hu, L, Cao, J, Xu, G, Cao, L, Gu, Z & Zhu, C 2013, 'Personalized recommendation via cross-domain triadic factorization', Proceedings of the 22nd international conference on World Wide Web WWW'13, International World Wide Web Conference, ACM, Rio de Janeiro, Brazil, pp. 595-606.View/Download from: Publisher's site
Collaborative filtering (CF) is a major technique in recommender systems to help users find their potentially desired items. Since the data sparsity problem is quite commonly encountered in real-world scenarios, Cross-Domain Collaborative Filtering (CDCF) hence is becoming an emerging research topic in recent years. However, due to the lack of sufficient dense explicit feedbacks and even no feedback available in users' uninvolved domains, current CDCF approaches may not perform satisfactorily in user preference prediction. In this paper, we propose a generalized Cross Domain Triadic Factorization (CDTF) model over the triadic relation user-item-domain, which can better capture the interactions between domain-specific user factors and item factors. In particular, we devise two CDTF algorithms to leverage user explicit and implicit feedbacks respectively, along with a genetic algorithm based weight parameters tuning algorithm to trade off influence among domains optimally. Finally, we conduct experiments to evaluate our models and compare with other state-of-the-art models by using two real world datasets. The results show the superiority of our models against other comparative models
Hu, L, Cao, J, Xu, G, Wang, J, Gu, Z & Cao, L 2013, 'Cross-Domain Collaborative Filtering via Bilinear Multilevel Analysis', Proceedings of the 23rd International Joint Conference on Artificial Intelligence, International Joint Conference on Artificial Intelligence, IJCAI/AAAI, Beijing, China, pp. 2626-2632.
Cross-domain collaborative filtering (CDCF), which aims to leverage data from multiple domains to relieve the data sparsity issue, is becoming an emerging research topic in recent years. However, current CDCF methods that mainly consider user and item factors but largely neglect the heterogeneity of domains may lead to improper knowledge transfer issues. To address this problem, we propose a novel CDCF model, the Bilinear Multilevel Analysis (BLMA), which seamlessly introduces multilevel analysis theory to the most successful collaborative filtering method, matrix factorization (MF). Specifically, we employ BLMA to more efficiently address the determinants of ratings from a hierarchical view by jointly considering domain, community, and user effects so as to overcome the issues caused by traditional MF approaches. Moreover, a parallel Gibbs sampler is provided to learn these effects. Finally, experiments conducted on a real-world dataset demonstrate the superiority of the BLMA over other state-of-the-art methods.
Li, F, Xu, G, Cao, L, Fan, X & Niu, Z 2013, 'CGMF: Coupled Group-Based Matrix Factorization for Recommender System', Lecture Notes in Computer Science, International Conference on Web Information Systems Engineering, Springer, Nanjing, China, pp. 289-298.View/Download from: Publisher's site
With the advent of social influence, social recommender systems have become an active research topic for making recommendations based on the ratings of the users that have close social relations with the given user. The underlying assumption is that a users taste is similar to his/her friends in social networking. In fact, users enjoy different groups of items with different preferences. A user may be treated as trustful by his/her friends more on some specific rather than all groups. Unfortunately, most of the extant social recommender systems are not able to differentiate users social influence in different groups, resulting in the unsatisfactory recommendation results. Moreover, most extant systems mainly rely on social relations, but overlook the influence of relations between items. In this paper, we propose an innovative coupled group-based matrix factorization model for recommender system by leveraging the user and item groups learned by topic modeling and incorporating couplings between users and items and within users and items. Experiments conducted on publicly available data sets demonstrate the effectiveness of our approach.
Song, Y, Zhang, J, Cao, L & Sangeux, M 2013, 'On Discovering the Correlated Relationship between Static and Dynamic Data in Clinical Gait Analysis', Lecture Notes in Computer Science, Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Springer, Prague, Czech Republic, pp. 563-578.View/Download from: Publisher's site
`Gait' is a person's manner of walking. Patients may have an abnormal gait due to a range of physical impairment or brain damage. Clinical gait analysis (CGA) is a technique for identifying the underlying impairments that affect a patients gait pattern. The CGA is critical for treatment planning. Essentially, CGA tries to use patients physical examination results, known as static data, to interpret the dynamic characteristics in an abnormal gait, known as dynamic data. This process is carried out by gait analysis experts, mainly based on their experience which may lead to subjective diagnoses. To facilitate the automation of this process and form a relatively objective diagnosis, this paper proposes a new probabilistic correlated static-dynamic model (CSDM) to discover correlated relationships between the dynamic characteristics of gait and their root cause in the static data space. We propose an EMbased algorithm to learn the parameters of the CSDM. One of the main advantages of the CSDM is its ability to provide intuitive knowledge. For example, the CSDM can describe what kinds of static data will lead to what kinds of hidden gait patterns in the form of a decision tree, which helps us to infer dynamic characteristics based on static data. Our initial experiments indicate that the CSDM is promising for discovering the correlated relationship between physical examination (static) and gait (dynamic) data.
Cao, L 2011, 'Agents and Data Mining Interaction', 7th International Workshop, ADMI, Springer, Taipei.
Cao, L 2011, 'New Frontiers in Applied Data Mining', PAKDD 2011 International Workshops, Springer, Shenzhen, China.
Cao, L 2012, 'Non-iidness: Coupled object and pattern analysis', Conferences in Research and Practice in Information Technology Series, p. 5.
© 2012, Australian Computer Society, Inc. Most of existing data mining algorithms are based on the IID assumption, which treats objects independently from each other. In the real world, objects are either loosely or tightly coupled with each other. For instance, a moving vehicle on the street interacts with the cars before and after it, and the ones on its left and right hand sides if any. In social networks, people interact with each other at different levels for varied purposes. Such interactions, or coupling relationships, are ubiquitous, and spread at various levels, between objects, between attributes describing an object, between attribute values within an attribute. It is crucial to cater for such relations in object analysis. On the other hand, the usual patterns identified by data mining are based on independent objects or items. For instance, often a large number of frequent patterns are mined by the existing algorithms, which are often treated as independent with each other. In fact, due to the object coupling relationships, patterns are associated with each other in structural and/or semantic aspects. Pattern relationship analysis is often ignored. In this talk, we will explore the needs, challenges, opportunities of analyzing complex object relations and complex pattern relations. On top of a framework for noniid-based coupled object and pattern analysis, several corresponding techniques will be introduced: Coupled object analysis to define and quantify the coupling relationships within and between objects and within and between attributes, combined pattern mining to identify a group of patterns coupled by certain relationships. Coupled behavior analysis will be explored to analyse a group of actors behaviors. We will show how such new frameworks outperform the classic iid-based data mining framework in terms of handling complex data, behavior, relation, environment and pattern in clustering, frequent pattern mining, and classification. Several real-...
Fan, X & Cao, L 2012, 'A Theoretical Framework of the Graph Shift Algorithm', Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence, ACM, Toronto, Ontario, Canada, pp. 2419-2420.
Since no theoretical foundations for proving the convergence of Graph Shift Algorithm have been reported, we provide a generic framework consisting of three key GS components to fit the Zangwillâs convergence theorem. We show that the sequence set generated by the GS procedures always terminates at a local maximum, or at worst, contains a subsequence which converges to a local maximum of the similarity measure function. What is more, a theoretical framework is proposed to apply our proof to a more general case.
Fan, X, Cao, L, Cui, X, Zhu, L & Ong, Y 2012, 'Maximum Margin Clustering on Evolutionary Data', Proceedings of the 21st ACM international conference on Information and knowledge management, ACM International Conference on Information and Knowledge Management, ACM, Maui, Hawaii, USA, pp. 625-634.View/Download from: Publisher's site
Evolutionary data, such as topic changing blogs and evolving trading behaviors in capital market, is widely seen in business and social applications. The time factor and intrinsic change embedded in evolutionary data greatly challenge evolutionary clustering. To incorporate the time factor, existing methods mainly regard the evolutionary clustering problem as a linear combination of snapshot cost and temporal cost, and reflect the time factor through the temporal cost. It still faces accuracy and scalability challenge though promising results gotten. This paper proposes a novel evolutionary clustering approach, evolutionary maximum margin clustering (e-MMC), to cluster large-scale evolutionary data from the maximum margin perspective. e-MMC incorporates two frameworks: Data Integration from the data changing perspective and Model Integration corresponding to model adjustment to tackle the time factor and change, with an adaptive label allocation mechanism. Three e-MMC clustering algorithms are proposed based on the two frameworks. Extensive experiments are performed on synthetic data, UCI data and real-world blog data, which confirm that e-MMC outperforms the state-of-the-art clustering algorithms in terms of accuracy, computational cost and scalability. It shows that e-MMC is particularly suitable for clustering large-scale evolving data.
Jiang, Y, Tsai, PC, Hao, Z & Cao, L 2012, 'A novel auto-parameters selection process for image segmentation', IEEE Congress on Evolutionary Computation 2012, IEEE Congress on Evolutionary Computation, IEEE, Brisbane, Australia, pp. 1-7.View/Download from: Publisher's site
Segmentation is a process to obtain the desirable features in image processing. However, the existing techniques that use the multilevel thresholding method in image segmentation are computationally demanding due to the lack of an automatic parameter selection process. This paper proposes an automatic parameter selection technique called an automatic multilevel thresholding algorithm using stratified sampling and Tabu Search (AMTSSTS) to remedy the limitations. It automatically determines the appropriate threshold number and values by (1) dividing an image into even strata (blocks) to extract samples; (2) applying a Tabu Search-based optimization technique on these samples to maximize the ratios of their means and variances; (3) preliminarily determining the threshold number and values based on the optimized samples; and (4) further optimizing these samples using a novel local criterion function that combines with the property of local continuity of an image. Experiments on Berkeley datasets show that AMTSSTS is an efficient and effective technique which can provide smoother results than several developed methods in recent year
Moemeng, C, Wang, C & Cao, L 2012, 'Obtaining an Optimal MAS Configuration for Agent-Enhanced Mining Using Constraint Optimization', Lecture Notes in Computer Science, International Workshop on Agents and Data Mining Interaction, Springer, Taipei, Taiwan, pp. 46-57.View/Download from: Publisher's site
We investigate an interaction mechanism between agents and data mining, and focus on agent-enhanced mining. Existing data mining tools use workflow to capture user requirements. The workflow enactment can be improved with a suitable underlying execution layer, which is a Multi-Agent System (MAS). From this perspective, we propose a strategy to obtain an optimal MAS configuration from a given workflow when resource access restrictions and communication cost constraints are concerned, which is essentially a constraint optimization problem. In this paper, we show how workflow is modeled in the way that can be optimized, and how the optimized model is used to obtain an optimal MAS configuration. Finally, we demonstrate that our strategy can improve the load balancing and reduce the communication cost during the workflow enactment.
She, Z, Wang, C & Cao, L 2012, 'CCE: A Coupled Framework of Clustering Ensembles', Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence, AAAI Conference on Artificial Intelligence, AAAI Press, Toronto, Ontario, Canada, pp. 2455-2456.
Clustering ensemble mainly relies on the pairwise similarity to capture the consensus function. However, it usually considers each base clustering independently, and treats the similarity measure roughly with either 0 or 1. To address these two issues, we propose a coupled framework of clustering ensembles CCE, and exemplify it with the coupled version CCSPA for CSPA. Experiments demonstrate the superiority of CCSPA over baseline approaches in terms of the clustering accuracy.
Song, Y & Cao, L 2012, 'Graph-based coupled behavior analysis: A case study on detecting collaborative manipulations in stock mark', The 2012 International Joint Conference on Neural Networks (IJCNN), IEEE International Joint Conference on Neural Networks, IEEE, Brisbane, Australia, pp. 1-8.View/Download from: Publisher's site
Coupled behaviors, which refer to behaviors having some relationships between them, are usually seen in many real-world scenarios, especially in stock markets. Recently, the coupled hidden Markov model (CHMM)-based coupled behavior analysis has been proposed to consider the coupled relationships in a hidden state space. However, it requires aggregation of the behavioral data to cater for the CHMM modeling, which may overlook the couplings within the aggregated behaviors to some extent. In addition, the Markov assumption limits its capability to capturing temporal couplings. Thus, this paper proposes a novel graph-based framework for detecting abnormal coupled behaviors. The proposed framework represents the coupled behaviors in a graph view without aggregating the behavioral data and is flexible to capture richer coupling information of the behaviors (not necessarily temporal relations). On top of that, the couplings are learned via relational learning methods and an efficient anomaly detection algorithm is proposed as well. Experimental results on a real-world data set in stock markets show that the proposed framework outperforms the CHMM-based one in both technical and business measures.
Song, Y, Cao, L, Wu, X, Wei, G, Ye, W & Ding, W 2012, 'Coupled behavior analysis for capturing coupling relationships in group-based market manipulations', Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM International Conference on Knowledge Discovery and Data Mining, ACM, Beijing, China, pp. 976-984.View/Download from: Publisher's site
In stock markets, an emerging challenge for surveillance is that a group of hidden manipulators collaborate with each other to manipulate the price movement of securities. Recently, the coupled hidden Markov model (CHMM)-based coupled behavior analysis (CBA) has been proposed to consider the coupling relationships in the above group-based behaviors for manipulation detection. From the modeling perspective, however, this requires overall aggregation of the behavioral data to cater for the CHMM modeling, which does not differentiate the coupling relationships presented in different forms within the aggregated behaviors and degrade the capability for further anomaly detection. Thus, this paper suggests a general CBA framework for detecting group-based market manipulation by capturing more comprehensive couplings and proposes two variant implementations, which are hybrid coupling (HC)-based and hierarchical grouping (HG)-based respectively. The proposed framework consists of three stages. The first stage, qualitative analysis, generates possible qualitative coupling relationships between behaviors with or without domain knowledge. In the second stage, quantitative representation of coupled behaviors is learned via proper methods. For the third stage, anomaly detection algorithms are proposed to cater for different application scenarios. Experimental results on data from a major Asian stock market show that the proposed framework outperforms the CHMM-based analysis in terms of detecting abnormal collaborative market manipulations. Additionally,the two different implementations are compared with their effectiveness for different application scenarios.
Wang, C, Wang, M, She, Z & Cao, L 2012, 'CD: A Coupled Discretization Algorithm', Lecture Notes in Computer Science, Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer, Kuala Lumpur, Malaysia, pp. 407-418.View/Download from: Publisher's site
Discretization technique plays an important role in data mining and machine learning. While numeric data is predominant in the real world, many algorithms in supervised learning are restricted to discrete variables. Thus, a variety of research has been conducted on discretization, which is a process of converting the continuous attribute values into limited intervals. Recent work derived from entropy-based discretization methods, which has produced impressive results, introduces information attribute dependency to reduce the uncertainty level of a decision table; but no attention is given to the increment of certainty degree from the aspect of positive domain ratio. This paper proposes a discretization algorithm based on both positive domain and its coupling with information entropy, which not only considers information attribute dependency but also concerns deterministic feature relationship. Substantial experiments on extensive UCI data sets provide evidence that our proposed coupled discretization algorithm generally outperforms other seven existing methods and the positive domain based algorithm proposed in this paper, in terms of simplicity, stability, consistency, and accuracy.
Wei, W, Fan, X, Li, J & Cao, L 2012, 'Model the Complex Dependence Structures of Financial Variables by Using Canonical Vine', The 21st ACM International Conference on Information and Knowledge Management, ACM International Conference on Information and Knowledge Management, Springer, Maui, Hawaii, USA, pp. 1382-1391.View/Download from: Publisher's site
Financial variables such as asset returns in the massive market contain various hierarchical and horizontal relationships forming complicated dependence structures. Modeling and mining of these structures is challenging due to their own high structural complexities as well as the stylized facts of the market data. This paper introduces a new canonical vine dependence model to identify the asymmetric and non-linear dependence structures of asset returns without any prior independence assumptions. To simplify the model while maintaining its merit, a partial correlation based method is proposed to optimize the canonical vine. Compared with the original canonical vine, the new model can still maintain the most important dependence but many unimportant nodes are removed to simplify the canonical vine structure. Our model is applied to construct and analyze dependence structures of European stocks as case studies. Its performance is evaluated by measuring portfolio of Value at Risk, a widely used risk management measure. In comparison to a very recent canonical vine model and the `full' model, our experimental results demonstrate that our model has a much better quality of Value at Risk, providing insightful knowledge for investors to control and reduce the aggregation risk of the portfolio.
Yin, J, Zheng, Z & Cao, L 2012, 'USpan: an efficient algorithm for mining high utility sequential patterns', Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM International Conference on Knowledge Discovery and Data Mining, ACM, Beijing, China, pp. 660-668.View/Download from: Publisher's site
Sequential pattern mining plays an important role in many applications, such as bioinformatics and consumer behavior analysis. However, the classic frequency-based framework often leads to many patterns being identified, most of which are not informative enough for business decision-making. In frequent pattern mining, a recent effort has been to incorporate utility into the pattern selection framework, so that high utility (frequent or infrequent) patterns are mined which address typical business concerns such as dollar value associated with each pattern. In this paper, we incorporate utility into sequential pattern mining, and a generic framework for high utility sequence mining is defined. An efficient algorithm, USpan, is presented to mine for high utility sequential patterns. In USpan, we introduce the lexicographic quantitative sequence tree to extract the complete set of high utility sequences and design concatenation mechanisms for calculating the utility of a node and its children with two effective pruning strategies. Substantial experiments on both synthetic and real datasets show that USpan efficiently identifies high utility sequences from large scale data with very low minimum utility.
Zhu, L, Cao, L & Yang, J 2012, 'Multiobjective evolutionary algorithm-based soft subspace clustering', IEEE Congress on Evolutionary Computation 2012, IEEE Congress on Evolutionary Computation, IEEE, Brisbane, Australia, pp. 1-8.View/Download from: Publisher's site
In this paper, a multiobjective evolutionary algorithm based soft subspace clustering, MOSSC, is proposed to simultaneously optimize the weighting within-cluster compactness and weighting between-cluster separation incorporated within two different clustering validity criteria. The main advantage of MOSSC lies in the fact that it effectively integrates the merits of soft subspace clustering and the good properties of the multiobjective optimization-based approach for fuzzy clustering. This makes it possible to avoid trapping in local minima and thus obtain more stable clustering results. Substantial experimental results on both synthetic and real data sets demonstrate that MOSSC is generally effective in subspace clustering and can achieve superior performance over existing state-of-the-art soft subspace clustering algorithms
Cao, L 2011, 'Advances in Knowledge Discovery and Data Mining 15th Pacific-Asia Conference proceedings part 1', PAKDD 2011, Springer, China.
Cao, L 2011, 'Advances in Knowledge Discovery and Data Mining 15th Pacific-Asia Conference proceedings part 2', PAKDD 2011, Springer, China.
Cao, L 2011, 'Proceedings of the 2011 IEEE/WIC/ACM International Conference on Intelligent Agent Technology, IAT 2011', Proceedings of the 2011 IEEE/WIC/ACM International Conference on Intelligent Agent Technology, IAT 2011, IAT 2011, ieee computer society, lyon, france.
Liu, B, Xiao, Y, Cao, L & Yu, P 2011, 'One-class-based uncertain data stream learning', Proceedings of the Eleventh SIAM International Conference on Data Mining, SIAM International Conference on Data Mining, SDM, Arizona, pp. 992-1003.View/Download from: Publisher's site
This paper presents a novel approach to one-class-based uncertain data stream learning. Our proposed approach works in three steps. Firstly, we put forward a local kerneldensity-based method to generate a bound score for each instance, which re?nes the location of the corresponding instance. Secondly, we construct an uncertain one-class classi?er by incorporating the generated bound score into a one-class SVM-based learning phase. Thirdly, we devise an ensemble classi?er, integrated from uncertain one-class classi?ers built on the current and historical chunks, to cope with the concept drift involved in the uncertain data stream environment. Our proposed method explicitly handles the uncertainty of the input data and enhances the ability of oneclass learning in reducing the sensitivity to noise. Extensive experiments on uncertain data streams demonstrate that our proposed approach can achieve better performance and is highly robust to noise in comparison with state-of-the-art one-class learning method.
Xiao, Y, Liu, B, Yin, J, Cao, L, Zhang, C & Hao, Z 2011, 'Similarity-Based Approach for Positive and Unlabeled Learning', Proceedings of the 22nd International Joint Conference on Artificial Intelligence, International Joint Conference on Artificial Intelligence, AAAI Press, Barcelona, Catalonia, Spain, pp. 1577-1582.View/Download from: Publisher's site
Positive and unlabelled learning (PU learning) has been investigated to deal with the situation where only the positive examples and the unlabelled examples are available. Most of the previous works focus on identifying some negative examples from the unlabelled data, so that the supervised learning methods can be applied to build a classifier. However, for the remaining unlabelled data, which can not be explicitly identified as positive or negative (we call them ambiguous examples), they either exclude them from the training phase or simply enforce them to either class. Consequently, their performance may be constrained. This paper proposes a novel approach, called similarity-based PU learning (SPUL) method, by associating the ambiguous examples with two similarity weights, which indicate the similarity of an ambiguous example towards the positive class and the negative class, respectively. The local similarity-based and global similarity-based mechanisms are proposed to generate the similarity weights. The ambiguous examples and their similarity-weights are thereafter incorporated into an SVM-based learning phase to build a more accurate classifier. Extensive experiments on real-world datasets have shown that SPUL outperforms state-of-the-art PU learning
Zhu, L, Cao, L & Yang, J 2011, 'Soft subspace clustering with competitive agglomeration', IEEE International Conference on Fuzzy Systems 2011, IEEE International Conference on Fuzzy Systems, IEEE, Taipei, pp. 691-698.View/Download from: Publisher's site
In this paper, two novel soft subspace clustering algorithms, namely fuzzy weighting subspace clustering with competitive agglomeration (FWSCA) and entropy weighting subspace clustering with competitive agglomeration (EWSCA), are proposed to overcome the problems of the unknown number of clusters and the initialization of prototypes for soft subspace clustering. The main advantage of FWSCA and EWSCA lies in the fact that they effectively integrate the merits of soft subspace clustering and the good properties of fuzzy clustering with competitive agglomeration. This makes it possible to obtain the appropriate number of clusters during the clustering progress. Moreover, FWSCA and EWSCA algorithms can converge regardless of the initial number of clusters and initialization. Substantial experimental results on both synthetic and real data sets demonstrate the effectiveness of FWSCA and EWSCA in addressing the two problems
Dong, X, Zheng, Z, Cao, L, Zhao, Y, Zhang, C, Li, J, Wei, W & Ou, Y 2011, 'e-NSP: efficient negative sequential pattern mining based on identified positive patterns without database rescanning', Proceedings of the 20th ACM International Conference on Information and Knowledge Management, ACM International Conference on Information and Knowledge Management, ACM, Glasgow, Scotland, UK, pp. 825-830.View/Download from: Publisher's site
Mining Negative Sequential Patterns (NSP) is much more challenging than mining Positive Sequential Patterns (PSP) due to the high computational complexity and huge search space required in calculating Negative Sequential Candidates (NSC). Very few approaches are available for mining NSP, which mainly rely on re-scanning databases after identifying PSP. As a result, they are very ine?cient. In this paper, we propose an e?cient algorithm for mining NSP, called e-NSP, which mines for NSP by only involving the identi?ed PSP, without re-scanning databases. First, negative containment is de?ned to determine whether or not a data sequence contains a negative sequence. Second, an e?cient approach is proposed to convert the negative containment problem to a positive containment problem. The supports of NSC are then calculated based only on the corresponding PSP. Finally, a simple but e?cient approach is proposed to generate NSC. With e-NSP, mining NSP does not require additional database scans, and the existing PSP mining algorithms can be integrated into e-NSP to mine for NSP e?ciently. eNSP is compared with two currently available NSP mining algorithms on 14 synthetic and real-life datasets. Intensive experiments show that e-NSP takes as little as 3% of the runtime of the baseline approaches and is applicable for efficient mining of NSP in large datasets.
Wang, C, Cao, L, Li, J, Wei, W, Ou, Y & Wang, M 2011, 'Coupled Nominal Similarity in Unsupervised Learning', Proceedings of the 20th ACM international conference on Information and knowledge management, ACM International Conference on Information and Knowledge Management, ACM, Glasgow, UK, pp. 973-978.View/Download from: Publisher's site
The similarity between nominal objects is not straightforward, especially in unsupervised learning. This paper proposes coupled similarity metrics for nominal objects, which consider not only intra-coupled similarity within an attribute (i.e., value frequency distribution) but also inter-coupled similarity between attributes (i.e. feature dependency aggregation). Four metrics are designed to calculate the inter-coupled similarity between two categorical values by considering their relationships with other attributes. The theoretical analysis reveals their equivalent accuracy and superior efficiency based on intersection against others, in particular for large-scale data. Substantial experiments on extensive UCI data sets verify the theoretical conclusions. In addition, experiments of clustering based on the derived dissimilarity metrics show a significant performance improvement.
Feng, J, Wang, M, Wang, C & Cao, L 2010, 'Enhanced co-occurrence distances for categorical data in unsupervised learning', 2010 International Conference on Machine Learning and Cybernetics, ICMLC 2010, International Conference on Machine Learning and Cybernetics, IEEE, Qingdao, pp. 2071-2078.View/Download from: Publisher's site
Distance metrics for categorical data play an important role in unsupervised learning such as clustering. They also dramatically affect learning accuracy and computational complexities. Recently, two co-occurrence methods, Co-occurrence Distance based on
Liu, B, Xiao, Y, Cao, L & Yu, P 2010, 'Orientation Distance-based Discriminative Feature Extraction for Multi-Class Classification', Proceedings of the 19th ACM Conference on Information and Knowledge Management & Co-Located Workshops (CIKM 2010), ACM Conference on Information and Knowledge Manage, ACM, Toronto, Ontario, Canada, pp. 909-918.View/Download from: Publisher's site
Feature extraction is an effective step in data mining and machine learning. While many feature extraction methods have been proposed for clustering, classification and regression, very limited work has been done on multi-class classification problems. In fact, the accuracy of multi-class classification problems relies on well-extracted features, the modeling part aside. This paper proposes a new feature extraction method, namely extracting orientation distance-based discriminative (ODD) features, which is particularly designed for multi-class classification problems. The proposed method works in two steps. In the first step, we extend the Fisher Discriminant idea to determine more appropriate kernel function and map the input data with all classes into a feature space. In the second step, the ODD features are extracted based on the one-vs-all scheme to generate discriminative features between a pattern and each hyperplane. These newly extracted features are treated as the representative features and are further used in the subsequent classification procedure. Substantial experiments on both UCI and real-world datasets have been conducted to investigate the performance of ODD features based multi-class classification. The statistical results show that the classification accuracy based on ODD features outperforms that of the state-of-the-art feature extraction methods.
Liu, B, Xiao, Y, Cao, L & Yu, P 2010, 'Vote-Based LELC for Positive and Unlabeled Textual Data Streams', 2010 IEEE International Conference on Data Mining Workshops (ICDMW), IEEE International Conference on Data Mining, IEEE Computer Society Conference Publishing Services (CPS), Sydney, NSW, Australia, pp. 951-958.View/Download from: Publisher's site
In this paper, we extend LELC (PU Learning by Extracting Likely Positive and Negative Micro-Clusters) method to cope with positive and unlabeled data streams. Our developed approach, which is called vote-based LELC, works in three steps. In the first step, we extract representative documents from unlabeled data and assign a vote score to each document. The assigned vote score reflects the degree of belongingness of an example towards its corresponding class. In the second step, the extracted representative examples, together with their vote scores, are incorporated into a learning phase to build an SVM-based classifier. In the third step, we propose the usage of an ensemble classifier to cope with concept drift involved in the textual data stream environment. Our developed approach aims at improving the performance of LELC by rendering examples to contribute differently to the construction of the classifier according to their vote scores. Extensive experiments on textual data streams have demonstrated that vote-based LELC outperforms the original LELC method.
Liu, B, Yin, J, Xiao, Y, Cao, L & Yu, P 2010, 'Exploiting Local Data Uncertainty to Boost Global Outlier Detection', ICDM 2010, The 10th IEEE International Conference on Data Mining, IEEE International Conference on Data Mining, IEEE, Sydney, pp. 304-313.View/Download from: Publisher's site
This paper presents a novel hybrid approach to outlier detection by incorporating local data uncertainty into the construction of a global classifier. To deal with local data uncertainty, we introduce a confidence value to each data example in the training data, which measures the strength of the corresponding class label. Our proposed method works in two steps. Firstly, we generate a pseudo training dataset by computing a confidence value of each input example on its class label. We present two different mechanisms: kernel k-means clustering algorithm and kernel LOF-based algorithm, to compute the confidence values based on the local data behavior. Secondly, we construct a global classifier for outlier detection by generalizing the SVDD-based learning framework to incorporate both positive and negative examples as well as their associated confidence values. By integrating local and global outlier detection, our proposed method explicitly handles the uncertainty of the input data and enhances the ability of SVDD in reducing the sensitivity to noise. Extensive experiments on real life datasets demonstrate that our proposed method can achieve a better tradeoff between detection rate and false alarm rate as compared to four state-of-the-art outlier detection algorithms.
Agent-based workflow has been proven its potential in overcoming issues in traditional workflow-based systems, such as decentralization, organizational issues, etc. The existing data mining tools provide workflow metaphor for data mining process visualization, audition and monitoring; these are particularly useful for distributed environments. In agent-based distributed data mining (ADDM), agents are an integral part of the system and can seamlessly incorporate with workflows. We describe a mechanism to use workflow in descriptive and executable styles to incorporate between workflow generators and executors. This paper shows that agent-based workflows can improve ADDM interoperability and flexibility, and also demonstrates the concepts and implementation with a supporting the argument, a multi-agent architecture and an agent-based workflow model are demonstrated.
Moemeng, C, Zhu, X, Cao, L & Chen, J 2010, 'i-Analyst: An Agent-Based Distributed Data Mining Platform', Data Mining Workshops (ICDMW), 2010 IEEE International Conference on, IEEE International Conference on Data Mining, IEEE, Sydney, NSW, pp. 1404-1406.View/Download from: Publisher's site
User-friendliness and performance are important properties of data mining and analysis tools. In this demo, we introduced an agent-based distributed data mining platform that allows users to manage and share the data-mining-related resources conveniently. Furthermore, the platform employs agents for workflow enactment in which the performance is improved with agent abilities. We also present an example to illustrate how the platform works in distributed environment. The performance is relatively competitive with non-agent approach when data is highly distributed and large.
Xiao, Y, Liu, B & Cao, L 2010, 'K-Farthest-Neighbors-Based Concept Boundary Determination for Support Vector Data Description', Proceedings of the 19th ACM International Conference on Information and Knowledge Management & Co-Located Workshops, ACM International Conference on Information and Knowledge Management, ACM, Toronto, Ontario, Canada,, pp. 1701-1704.View/Download from: Publisher's site
Support vector data description (SVDD) is very useful for oneclass classification. However, it incurs high time complexity in handling large scale data. In this paper, we propose a novel and efficient method, named K-Farthest-Neighbors-based Concept Boundary Detection (KFN-CBD for short), to improve the SVDD learning efficiency on large datasets. This work is motivated by the observation that SVDD classifier is determined by support vectors (SVs), and removing the non-support vectors (non-SVs) will not change the classifier but will reduce computational costs. Our approach consists of two steps. In the first step, we propose the K-farthest-neighbors method to identify the samples around the hyper-sphere surface, which are more likely to be SVs. At the same time, a new tree search strategy of M-tree is presented to speed up the K-farthest neighbor query. In the second step, the non-SVs are eliminated from the training set, and only the identified boundary samples are used to train the SVDD classifier. By removing the non-SVs, the training time of SVDD can be substantially reduced. Extensive experiments have shown that KFN-CBDachieves around 6 times speedup compared to the standard SVDD, and obtains the comparable classification quality as the entire dataset used.
Xiao, Y, Liu, B, Cao, L, Yin, J & Wu, X 2010, 'SMILE: A Similarity-Based Approach for Multiple Instance Learning', 2010 IEEE 10th International Conference on Data Mining (ICDM), IEEE International Conference on Data Mining, IEEE Computer Society Conference Publishing Services (CPS), Sydney, NSW, Australia, pp. 589-598.View/Download from: Publisher's site
Multiple instance learning (MIL) is a generalization of supervised learning which attempts to learn useful information from bags of instances. In MIL, the true labels of the instances in positive bags are not always available for training. This leads to a critical challenge, namely, handling the ambiguity of instance labels in positive bags. To address this issue, this paper proposes a novel MIL method named SMILE (Similarity-based Multiple Instance LEarning). It introduces a similarity weight to each instance in positive bag, which represents the instance similarity towards the positive and negative classes. The instances in positive bags, together with their similarity weights, are thereafter incorporated into the learning phase to build an extended SVM-based predictive classifier. Experiments on three real-world datasets consisting of 12 subsets show that SMILE achieves markedly better classification accuracy than state-of-the-art MIL methods.
Yang, T, Cao, L & Zhang, C 2010, 'A Novel Prototype Reduction Method for the K-Nearest Neighbor Algorithm with K >= 1', Advances in Knowledge Discovery and Data Mining - Lecture Notes in Artificial Intelligence, Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer Berlin / Heidelberg, Hyderabad, India, pp. 89-100.View/Download from: Publisher's site
In this paper, a novel prototype reduction algorithm is proposed, which aims at reducing the storage requirement and enhancing the online speed while retaining the same level of accuracy for a K-nearest neighbor (KNN) classifier. To achieve this goal, our proposed algorithm learns the weighted similarity function for a KNN classifier by maximizing the leave-one-out cross-validation accuracy. Unlike the classical methods PW, LPD and WDNN which can only work with K>=1, our developed algorithm can work with K>=1. This flexibility allows our learning algorithm to have superior classification accuracy and noise robustness. The proposed approach is assessed through experiments with twenty real world benchmark data sets. In all these experiments, the proposed approach shows it can dramatically reduce the storage requirement and online time for KNN while having equal or better accuracy than KNN, and it also shows comparable results to several prototype reduction methods proposed in literature.
Yang, T, Kecman, V, Cao, L & Zhang, C 2010, 'Combining Support Vector Machines and the t-statistic for Gene Selection in DNA Microarray Data Analysis', Advances in Knowledge Discovery and Data Mining - Lecture Notes in Artificial Intelligence, Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer Berlin / Heidelberg, Hyderabad, India, pp. 55-62.View/Download from: Publisher's site
This paper proposes a new gene selection (or feature selection) method for DNA microarray data analysis. In the method, the t-statistic and support vector machines are combined efficiently. The resulting gene selection method uses both the data intrinsic information and learning algorithm performance to measure the relevance of a gene in a DNA microarray. We explain why and how the proposed method works well. The experimental results on two benchmarking microarray data sets show that the proposed method is competitive with previous methods. The proposed method can also be used for other feature selection problems.
Yang, T, Kecman, V, Cao, L & Zhang, C 2010, 'Testing Adaptive Local Hyperplane for multi-class classification by double cross-validation', The 2010 International Joint Conference on Neural Networks (IJCNN), International Joint Conference on Neural Networks, IEEE, Barcelona, Spain, pp. 1-5.View/Download from: Publisher's site
Adaptive Local Hyperplane (ALH) is a recently proposed classifier for the multi-class classification problems and it has shown encouraging performance in many pattern recognition problems. However, ALH's performance over many general classification datasets has only been tested by using a single loop of cross-validation procedure, where the whole datasets are used for both hyper-parameter determination and accuracy estimation. This procedure is appropriate for classifier performance comparison, but the produced results are likely to be optimistic for classifier accuracy estimation on new datasets. In this paper, we test the performance of ALH as well as several other benchmark classifiers by using two loops of cross-validation (a.k.a. double resampling) procedure, where the inner loop is used for hyper-parameter determination and the outer loop is used for accuracy estimation. With such a testing scheme, the classification accuracy of a tested classifier can be evaluated in a more strict way. The experimental results indicate the superior performance of the ALH classifier with respect to the traditional classifiers including Support Vector Machine (SVM), K-Nearest Neighbor (KNN), Linear Discriminant Analysis (LDA), Classification Tree (Tree) and K-local Hyperplane distance Nearest Neighbor (HKNN). These results imply that the ALH classifier might become a useful tool for the pattern recognition tasks.
Yang, Y, Cao, L & Liu, L 2010, 'Time-Sensitive Feature Mining for Temporal Sequence Classification', Lecture Notes in Artificial Intelligence 6230 - PRICAI 2010: Trends in Artificial Intelligence, Pacific Rim International Conference on Artificial Intelligence, Springer-Verlag Berlin Heidelberg, Daegu, Korea, pp. 315-326.View/Download from: Publisher's site
Behavior analysis received much attention in recent year, such as customer-relationship management, social security surveillance and e-business. Discovering high impact-driven behavior patterns is important for detecting and preventing their occurrences and reducing resulting risks and losses to our society. In data mining community, researchers pay little attention to time-stamps in temporal behavior sequences (without explicitly considering inherent temporal information) during classification. In this paper, we propose a novel Temporal Feature Extraction Method - TFEM. It extracts sequential pattern features where each transition is annotated with a typical transition time (its duration or interval). Therefore it substantially enriches temporal characteristics derived from temporal sequences, yielding improvements in performances, as demonstrated by a set of experiments performed on synthetic and real-world datasets. In addition, TFEM has the merit of simplicity in implementation and its pattern-based architecture can generate human-readable results and supply clear interpretability to users. Meanwhile, it is adjustable and adaptive to userâs different configurations, allowing a tradeoff between classification accuracy and time cost.
Zhao, Y, Bohlscheid, H, Wu, S & Cao, L 2010, 'Less Effort, More Outcomes: Optimising Debt Recovery with Decision Trees', Data Mining Workshops (ICDMW), 2010 IEEE International Conference on, IEEE International Conference on Data Mining, IEEE Computer Society Conference Publishing Services (CPS), Sydney, NSW, Australia, pp. 655-660.View/Download from: Publisher's site
This paper presents a real-world application of data mining techniques to optimise debt recovery in social security. The traditional method of contacting a customer for the purpose of putting in place a debt recovery schedule has been an out-bound phone call, and by and large, customers are chosen at random. This obsolete and inefficient method of selecting customers for debt recovery purposes has existed for years and in order to improve this process, decision trees were built to model debt recovery and predict the response of customers if contacted by phone. Test results on historical data show that, the built model is effective to rank customers in their likelihood of entering into a successful debt recovery repayment schedule. If contacting the top 20 per cent of customers in debt, instead of contacting all of them, approximately 50 per cent of repayments would be received.
Zheng, Z, Zhao, Y, Zuo, Z & Cao, L 2010, 'An Efficient GA-Based Algorithm for Mining Negative Sequential Patterns', Advances in Knowledge Discovery and Data Mining - Lecture Notes in Artificial Intelligence, Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer Berlin / Heidelberg, Hyderabad, India, pp. 262-273.View/Download from: Publisher's site
Negative sequential pattern mining has attracted increasing concerns in recent datamining research because it considers negative relationships between itemsets, which are ignored by positive sequential pattern mining. However, the search space for mining negative patterns is much bigger than that for positive ones.When the support threshold is low, in particular, there will be huge amounts of negative candidates. This paper proposes a Genetic Algorithm (GA) based algorithm to find negative sequential patterns with novel crossover and mutation operations, which are efficient at passing good genes on to next generations without generating candidates. An effective dynamic fitness function and a pruning method are also provided to improve performance. The results of extensive experiments show that the proposed method can find negative patterns efficiently and has remarkable performance compared with some other algorithms of negative pattern mining.
Cao, L, Ou, Y, Yu, P & Wei, G 2010, 'Detecting Abnormal Coupled Sequences and Sequence Changes in Group-based Manipulative Trading Behaviors', Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data, ACM SIGKDD International Conference on Knowledge Discovery and Data, ACM, Washington DC, DC, USA, pp. 85-93.View/Download from: Publisher's site
In capital market surveillance, an emerging trend is that a group of hidden manipulators collaborate with each other to manipulate three trading sequences: buy-orders, sell-orders and trades, through carefully arranging their prices, volumes and time, in order to mislead other investors, affect the instrument movement, and thus maximize personal benefits. If the focus is on only one of the above three sequences in attempting to analyze such hidden group based behavior, or if they are merged into one sequence as per an investor, the coupling relationships among them indicated through trading actions and their prices/volumes/times would be missing, and the resulting findings would have a high probability of mismatching the genuine fact in business. Therefore, typical sequence analysis approaches, which mainly identify patterns on a single sequence, cannot be used here. This paper addresses a novel topic, namely coupled behavior analysis in hidden groups. In particular, we propose a coupled Hidden Markov Models (HMM)-based approach to detect abnormal group-based trading behaviors. The resulting models cater for (1) multiple sequences from a group of people, (2) interactions among them, (3) sequence item properties, and (4) significant change among coupled sequences. We demonstrate our approach in detecting abnormal manipulative trading behaviors on orderbook-level stock data. The results are evaluated against alerts generated by the exchange's surveillance system from both technical and computational perspectives. It shows that the proposed coupled and adaptive HMMs outperform a standard HMM only modeling any single sequence, or the HMM combining multiple single sequences, without considering the coupling relationship. Further work on coupled behavior analysis, including coupled sequence/event analysis, hidden group analysis and behavior dynamics are very critical.
Cao, L 2009, 'Data Mining in Financial Markets', Advanced Data Mining and Applications, 5th International Conference, ADMA 2009, International Conference on Advanced Data Mining and Applications, Springer, Budapest, Hungary, pp. 4-4.
Cao, L, Luo, D & Zhang, C 2009, 'Ubiquitous Intelligence in Agent Mining', ADMI 2009, International Workshop on Agents and Data Mining Interaction, Springer, Budapest, Hungary, pp. 23-35.View/Download from: Publisher's site
Agent mining, namely the interaction and integration of multi-agent and data mining, has emerged as a very promising research area. While many mutual issues exist in both multi-agent and data mining areas, most of them can be described in terms of or related to ubiquitous intelligence. It is certainly very important to define, specify, represent, analyze and utilize ubiquitous intelligence in agents, data mining, and agent mining. This paper presents a novel but preliminary investigation of ubiquitous intelligence in these areas. We specify five types of ubiquitous intelligence: data intelligence, human intelligence, domain intelligence, network and web intelligence, organizational intelligence, and social intelligence. We define and illustrate them, and discuss techniques for involving them into agents, data mining, and agent mining for complex problem-solving. Further investigation on involving and synthesizing ubiquitous intelligence into agents, data mining, and agent mining will lead to a disciplinary upgrade from methodological, technical and practical perspectives.
Goh, T-T, Bose, S, Ng, WK, Cao, L & Lee, VCS 2009, 'Editorial to the Proceedings of Mobile Technologies in Enterprise Computing Systems Workshop (MTECS 2009)', 2009 13TH ENTERPRISE DISTRIBUTED OBJECT COMPUTING CONFERENCE WORKSHOPS (EDOCW 2009), 13th IEEE International Enterprise Distributed Object Computing Conference (EDOC 2009), IEEE, Auckland, NEW ZEALAND, pp. 138-+.View/Download from: Publisher's site
Tsai, PC, Tran, TP & Cao, L 2009, 'Expression-invariant Facial Identification', Proceedings 2009 IEEE International Conference on Systems, Man and Cybernetics, IEEE Conference on Systems, Man and Cybernetics, IEEE, San Antonio, Texas, USA, pp. 5151-5155.View/Download from: Publisher's site
Facial identification has been recognized as most simple and non-intrusive technology that can be applied in many places. However, there are still many unsolved facial identification problems due to different intra-personal variations. In particular, when images of the databases appear at different facial expressions, most currently available facial recognition approaches encounter the expression-invariant problem in which neutral faces are difficult to be recognized. In this paper, a new approach is proposed to transform facial expressions to neutral-face like images; hence enabling image retrieval systems to robustly identify a persons face for which its learning and testing face images differ in facial expression.
Xiao, Y, Liu, B, Cao, L, Wu, X, Zhang, C, Hao, Z, Yang, F & Cao, J 2009, 'Multi-sphere Support Vector Data for Outliers Detection on Multi-distribution Data', Data Mining Workshops, 2009. ICDMW '09. IEEE International Conference on, IEEE International Conference on Data Mining, IEEE Computer Society Press, Miami, Florida, pp. 82-87.View/Download from: Publisher's site
SVDD has been proved a powerful tool for outlier detection. However, in detecting outliers on multi-distribution data, namely there are distinctive distributions in the data, it is very challenging for SVDD to generate a hyper-sphere for distinguishing outliers from normal data. Even if such a hyper-sphere can be identified, its performance is usually not good enough. This paper proposes an multi-sphere SVDD approach, named MS-SVDD, for outlier detection on multi-distribution data. First, an adaptive sphere detection method is proposed to detect data distributions in the dataset. The data is partitioned in terms of the identified data distributions, and the corresponding SVDD classifiers are constructed separately. Substantial experiments on both artificial and real-world datasets have demonstrated that the proposed approach outperforms original SVDD.
Zhao, G, Xiong, Y, Cao, L, Luo, D, Su, X & Zhu, Y 2009, 'A Cost-Effective LSH Filter for Fast Pairwise Mining', ICDM 2009, The Ninth IEEE International Conference on Data Mining, IEEE International Conference on Data Mining, IEEE Computer Society, Miami, Florida, USA, pp. 1088-1093.View/Download from: Publisher's site
The pairwise mining problem is to discover pairwise objects having measures greater than the user-specified minimum threshold from a collection of objects. It is essential in a large variety of database and data-mining applications. Of late, there has been increasing interest in applying a Locality-Sensitive Hashing (LSH) scheme for pairwise mining. LSH-type methods have shown themselves to be simply implementable and capable of achieving significant performance gain in running time over most exact methods. However, the present LSH-type methods still suffer from some bottlenecks, such as the curse of threshold. In this paper, we proposed a novel LSHbased method, namely Cost-effective LSH filter (Ce-LSH for short), for pairwise mining. Compared with previous LSH-type methods, it uses a lower fixed number of LSH functions and is thus more cost-effective. Substantial experiments evidence that our method gives significant improvement in running time over existing LSH-type methods and some recently reported method based on upper-bound. Experimental results also indicate that it scales well even for a relatively low minimum threshold and for a fairly small miss ratio.
Zhao, Y, Zhang, H, Cao, L, Zhang, C & Bohlscheid, H 2009, 'Mining Both Positive and Negative Impact-Oriented Sequential Rules from Transactional Data', Advances in Knowledge Discovery and Data Mining, 13th Pacific-Asia Conference, PAKDD 2009, Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer, Bangkok, Thailand, pp. 656-663.
Traditional sequential pattern mining deals with positive correlation between sequential patterns only, without considering negative relationship between them. In this paper, we present a notion of impact-oriented negative sequential rules, in which the left side is a positive sequential pattern or its negation, and the right side is a predefined outcome or its negation. Impact-oriented negative sequential rules are formally defined to show the impact of sequential patterns on the outcome, and an efficient algorithm is designed to discover both positive and negative impact-oriented sequential rules. Experimental results on both synthetic data and real-life data show the efficiency and effectiveness of the proposed technique.
Zhao, Y, Zhang, H, Wu, S, Pei, J, Cao, L, Zhang, C & Bohlscheid, H 2009, 'Debt Detection in Social Security by Sequence Classification Using Both Positive and Negative Patterns', Machine Learning and Knowledge Discovery in Databases, European Conference, ECML PKDD 2009, European Conference on Machine Learning, Springer, Bled, Slovenia, pp. 648-663.View/Download from: Publisher's site
Debt detection is important for improving payment accuracy in social security. Since debt detection from customer transactional data can be generally modelled as a fraud detection problem, a straightforward solution is to extract features from transaction sequences and build a sequence classifier for debts. The existing sequence classification methods based on sequential patterns consider only positive patterns. However, according to our experience in a large social security application, negative patterns are very useful in accurate debt detection. In this paper, we present a successful case study of debt detection in a large social security application. The central technique is building sequence classification using both positive and negative sequential patterns.
Zheng, Z, Zhao, Y, Zuo, Z & Cao, L 2009, 'Negative-GSP: An Efficient Method for Mining Negative Sequential Patterns', Proceedings of the 8th Australasian Data Mining Conference (AusDM'09): Data Mining and Analytics - Conferences in Research and Practice in Information Technology Volume 101, Australian Data Mining Conference, Australian Computer Society, Melbourne, Australia, pp. 63-67.
Different from traditional positive sequential pattern mining, negative sequential pattern mining considers both positive and negative relationships between items. Negative sequential pattern mining doesn't necessarily follow the Apriori principle, and the searching space is much larger than positive pattern mining. Giving definitions and some constraints of negative sequential patterns, this paper proposes a new method for mining negative sequential patterns, called Negative-GSP. Negative-GSP can find negative sequential patterns effectively and efficiently by joining and pruning, and extensive experimental results show the efficiency of the method.
Cao, L 2008, 'Behavior Informatics and Analytics: Let Behavior Talk', Proceedings of the 2008 IEEE International Conference on Data Mining Workshops, IEEE International Conference on Data Mining, IEEE Computer Society, Pisa, Italy, pp. 87-96.View/Download from: Publisher's site
Behavior is increasingly recognized as a key component in business intelligence and problem-solving. Different from traditional behavior analysis, which mainly focus on implicit behavior and explicit business appearance as a result of business usage and customer demographics, this paper proposes the field of Behavior Informatics and Analytics (BIA), to support explicit behavior involvement through a conversion from transactional data to behavioral data, and further genuine analysis of native behavior patterns and impacts. BIA consists of key components including behavior modeling and representation, behavioral data construction, behavior impact modeling, behavior pattern analysis, and behavior presentation. BIA can greatly complement the existing means for combined, more informative and social patterns and solutions for critical problem-solving in areas such as dealing with customer-officer interaction, counterterrorism and monitoring online communities.
In deploying data mining into the real-world business, we have to cater for business scenarios, organizational factors, user preferences and business needs. However, the current data mining algorithms and tools often stop at the delivery of patterns satisfying expected technical interestingness. Business people are not informed about how and what to do to take over the technical deliverables. The gap between academia and business has seriously affected the widespread employment of advanced data mining techniques in greatly promoting enterprise operational quality and productivity. To narrow down the gap, cater for realworld factors relevant to data mining, and make data mining workable in supporting decision-making actions in the real world, we propose the methodology of Domain Driven Data Mining (D3M for short). D3M aims to construct next-generation methodologies, techniques and tools for a possible paradigm shift from data-centered hidden pattern mining to domain-driven actionable knowledge delivery. In this talk, we address the concept map of D3M, theoretical underpinnings, several general and flexible frameworks, research issues, possible directions, application areas etc. related to D3M. Real-world case studies in financial data mining and social security mining are demonstrated to show the effectiveness and applicability of D3M in both research and development of real-world challenging problems. © 2008 IEEE.
Cao, L 2008, 'Domain Driven Data Mining (D3M)', Proceedings of the 2008 IEEE International Conference on Data Mining Workshops, IEEE International Conference on Data Mining, IEEE Computer Society, Pisa, Italy, pp. 74-76.View/Download from: Publisher's site
In deploying data mining into the real-world business, we have to cater for business scenarios, organizational factors, user preferences and business needs. However, the current data mining algorithms and tools often stop at the delivery of patterns satisfying expected technical interestingness. Business people are not informed about how and what to do to take over the technical deliverables. The gap between academia and business has seriously affected the widespread employment of advanced data mining techniques in greatly promoting enterprise operational quality and productivity. To narrow down the gap, cater for realworld factors relevant to data mining, and make data mining workable in supporting decision-making actions in the real world, we propose the methodology of Domain Driven Data Mining (D3M for short). D3M aims to construct next-generation methodologies, techniques and tools for a possible paradigm shift from data-centered hidden pattern mining to domain-driven actionable knowledge delivery. In this talk, we address the concept map of D3M, theoretical underpinnings, several general and flexible frameworks, research issues, possible directions, application areas etc. related to D3M. Real-world case studies in financial data mining and social security mining are demonstrated to show the effectiveness and applicability of D3M in both research and development of real-world challenging problems.
Cao, L 2008, 'Metasynthetic Computing for Solving Open Complex Problems', Proceedings of the 2008 32nd Annual IEEE International Computer Software and Applications Conference, International Computer Software and Applications Conference, IEEE Computer Society, Turku, Finland, pp. 896-901.View/Download from: Publisher's site
Complex systems, in particular, open complex giant systems have become one of major challenges to many current disciplines such as system sciences, cognitive sciences, intelligence sciences, computer sciences, and information sciences. An appropriate methodology for dealing with them is the theory of qualitative-to-quantitative metasynthesis. From the perspective of engineering, we propose the concept of metasynthetic computing. This paper discusses the theoretical framework, problem-solving process and intelligence emergence of metasynthetic computing from both engineering and cognition perspectives. These efforts can help one understand complex systems and design effective problem-solving systems.
Cao, L, Luo, D, Xiao, Y & Zheng, Z 2008, 'Agent Collaboration for Multiple Trading Strategy Integration', Lecture Notes in Artificial Intelligence Vol 4953: Agent and Multi-Agent Systems: Technologies and Applications, International KES Symposium on Agents and Multiagent systems - Technologies and Applications, Springer Berlin, Incheon, Korea,, pp. 361-370.View/Download from: Publisher's site
The collaboration of agents can undertake complicated tasks that cannot be handled well by a single agent. This is even true for excecuting multiple goals at the same time. In this paper, we demonstrate the use of trading agent collaboration in integrating multiple trading strategies. Trading agents are used for developing quality trading strategies to support smart actions in the market. Evolutionary trading agents are armed with evolutionary computing capability to optimize strategy parameters. To develop even smarter trading strategies (we call golden strategies), multiple Evolutionary and Collaborative trading agents negotiate with each other for m loops to search multiple local strategies with best parameter combinations. They also integrate multiple classes of strategies for trading agents to achieve the best global objectives acceptable for trader needs. Tests of five classes of trading strategies in ten years of five markets of data have shown that agent collaboration for strategy integration can achieve much better performance of trading compared with that of either individually optimized or randomly chosen strategies.
Liu, B, Cao, L, Yu, P & Zhang, C 2008, 'Multi-Space-Mapped SVMs for Multi-Class Classification', 2008 Eighth IEEE International Conference on Data Mining, IEEE International Conference on Data Mining, IEEE, Pisa, Italy, pp. 911-916.View/Download from: Publisher's site
In SVMs-based multiple classification, it is not always possible to find an appropriate kernel function to map all the classes from different distribution functions into a feature space where they are linearly separable from each other. This is even worse if the number of classes is very large. As a result, the classification accuracy is not as good as expected. In order to improve the performance of SVMs-based multi-classifiers, this paper proposes a method, named multi-space-mapped SVMs, to map the classes into different feature spaces and then classify them. The proposed method reduces the requirements for the kernel function. Substantial experiments have been conducted on One-against-All, One-against-One, FSVM, DDAG algorithms and our algorithm using six UCI data sets. The statistical results show that the proposed method has a higher probability of finding appropriate kernel functions than traditional methods and outperforms others.
Moemeng, C, Cao, L & Zhang, C 2008, 'F-TRADE 3.0: An Agent-Based Integrated Framework for Data Mining Experiments', 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Workshops, IEEE/WIC/ACM international Conference on Web Intelligence and Intelligent Agent Technology, IEEE Computer Society, University of Technology, Sydney, Australia, pp. 612-615.View/Download from: Publisher's site
Data mining researches focus on algorithms that mine valuable patterns from particular domain. Apart from the theoretical research, experiments take a vast amount of effort to build. In this paper, we propose an integrated framework that utilises a multi-agent system to support the researchers to rapidly develop experiments. Moreover, the proposed framework allows extension and integration for future researches in mutual aspects of agent and data mining. The paper describes the details of the framework and also presents a sample implementation.
Qiu, X, Jiang, S, Liu, H, Huang, Q & Cao, L 2008, 'Spatial-temporal attention analysis for home video', 2008 IEEE International Conference on Multimedia & Expo, IEEE International Conference on Multimedia and Expo, IEEE, Hannover Congress Centrum, Hannover, Germany, pp. 1517-1520.View/Download from: Publisher's site
In this paper, by considering the multiple spatial-temporal characteristic of visual perception system, we propose a novel home video attention analysis method. Firstly, each frame of the video is segmented into regions which are more informative than pixels and image blocks. Then the saliency of each region is analyzed by combining static, motion and location attentions. Finally a region based saliency map is generated for each frame, and an attention score curve is obtained for the video clip by combining attention scores of all regions in each frame. Both of them can be utilized in wide applications. This method takes advantage of the properties of human visual perception and can well present the attention information of home videos. Experimental results show the effectiveness of this approach.
Xiao, Y, Liu, B, Luo, D & Cao, L 2008, 'Multi-agent system for custom relationship management with SVMs tool', 2nd KES International Symposium on Agent and Multi-Agent Systems: Technologies and Applications, KES-AMSTA 2008, International KES Symposium on Agents and Multiagent systems - Technologies and Applications, Springer Berlin / Heidelb, Incheon, pp. 333-340.View/Download from: Publisher's site
Distributed data mining in the CRM is to learn available knowledge from the customer relationship so as to instruct the strategic behavior. In order to resolve the CRM in distributed data mining, this paper proposes the architecture of distributed data mining for CRM, and then utilizes the support vector machine tool to separate the customs into several classes and manage them. In the end, the practical experiments about one Chinese company are conducted to show the good performance of the proposed approach. Â© 2008 Springer-Verlag Berlin Heidelberg.
Zhang, H, Zhao, Y, Cao, L & Zhang, C 2008, 'Combined Association Rule Mining', Lecture Notes in Artificial Intelligence Vol 5012: Advances in Knowledge Discovery and Data Mining, Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer, Osaka, Japan, pp. 1069-1074.View/Download from: Publisher's site
This paper proposes an algorithm to discover novel association rules, combined association rules. Compared with conventional association rule, this combined association rule allows users to perform actions directly. Combined association rules are always organized as rule sets, each of which is composed of a number of single combined association rules. These single rules consist of non-actionable attributes, actionable attributes, and class attribute, with the rules in one set sharing the same non-actionable attributes. Thus, for a group of objects having the same non-actionable attributes, the actions corresponding to a preferred class can be performed directly. However, standard association rule mining algorithms encounter many difficulties when applied to combined association rule mining, and hence new algorithms have to be developed for combined association rule mining. In this paper, we will focus on rule generation and interestingness measures in combined association rule mining. In rule generation, the frequent itemsets are discovered among itemset groups to improve efficiency. New interestingness measures are defined to discover more actionable knowledge. In the case study, the proposed algorithm is applied into the field of social security. The combined association rule provides much greater actionable knowledge to business owners and users.
Zhao, Y, Zhang, H, Cao, L, Zhang, C & Bohlscheid, H 2008, 'Combined Pattern Mining: from Learned Rules to Actionable Knowledge', AI 2008: Advances in Artificial Intelligence: Lecture Notes in Artificial Intelligence 5360, Australasian Joint Conference on Artificial Intelligence, Springer, Auckland, Newzealand, pp. 393-403.View/Download from: Publisher's site
Association mining often produces large collections of association rules that are difficult to understand and put into action. In this paper, we have designed a novel notion of combined patterns to extract useful and actionable knowledge from a large amount of learned rules. We also present definitions of combined patterns, design novel metrics to measure their interestingness and analyze the redundancy in combined patterns. Experimental results on real-life social security data demonstrate the effectiveness and potential of the proposed approach in extracting actionable knowledge from complex data.
Zhao, Y, Zhang, H, Cao, L, Zhang, C & Bohlscheid, H 2008, 'Efficient Mining of Event-Oriented Negative Sequential Rules', 2008 IEEE/WIC/ACM International Conference on Web Intelligence, IEEE/WIC/ACM international Conference on Web Intelligence and Intelligent Agent Technology, IEEE Computer Society, University of Technology, Sydney, Australia, pp. 336-342.View/Download from: Publisher's site
Traditional sequential pattern mining deals with positive sequential patterns only, that is, only frequent sequential patterns with the appearance of items are discovered. However, it is often interesting in many applications to find frequent sequential patterns with the non-occurrence of some items, which are referred to as negative sequential patterns. This paper analyzes three types of negative sequential rules and presents a new technique to find event-oriented negative sequential rules. Its effectiveness and efficiency are shown in our experiments.
Luo, C, Zhao, Y, Cao, L, Ou, Y & Liu, L 2008, 'Outlier Mining on Multiple Time Series Data in Stock Market', PRICAI 2008: Trends in Artificial Intelligence, Pacific Rim International Conference on Artificial Intelligence, Springer, Hanoi, Vietnam, pp. 1010-1015.View/Download from: Publisher's site
In stock market, the key surveillance function is identifying market anomalies, such as insider trading and market manipulation, to provide a fair and efficient trading platform [2,6]. Insider trading refers to the trades on privileged information unavailable to the public . Market manipulation refers to the trade or action which aims to interfere with the demand or supply of a given stock to make the price increase or decrease in a particular way . Recently, new intelligent technologies are required to deal with the challenges of the rapid increase of stock data. Outlier mining technologies have been used to detect market manipulation and insider trading . The objective of outlier mining is to find the data objects which are grossly different from or inconsistent with the majority of data. However, in stock market data, outliers are highly intermixed with normal data  and it is difficult to judge whether an object is an outlier or not. Therefore, a more effective and more efficient approach is in demand. This paper presents a new technique for outlier detection on multiple time series data in stock market. At first, principal curve algorithm is used to detect the outliers from individual measurements of stock market. Then, the generated outliers are measured with the probability of being real alerts. To improve the accuracy and precision, these outliers are combined by some rules associated with the domain knowledge. The experimental results on real stock market data show that the proposed model is feasible in practice and achieves a higher accuracy and precision than traditional methods
Luo, C, Zhao, Y, Cao, L, Ou, Y & Zhang, C 2008, 'Exception Mining on Multiple Time Series in Stock Market', 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, IEEE/WIC/ACM international Conference on Web Intelligence and Intelligent Agent Technology, Springer, Sydney, Australia, pp. 690-693.View/Download from: Publisher's site
This paper presents our research on exception mining on multiple time series data which aims to assist stock market surveillance by identifying market anomalies. Traditional technologies on stock market surveillance have shown their limitations to handle large amount of complicated stock market data. In our research, the Outlier Mining on Multiple time series (OMM) is proposed to improve the effectiveness of exception detection for stock market surveillance. The idea of our research is presented, challenges on the research are analyzed, and potential research directions are summarized.
Ou, Y, Cao, L, Luo, C & Liu, L 2008, 'Mining Exceptional Activity Patterns in Microstructure Data', 2008 IEEE/WIC/ACM international Conference on Web Intelligence and Intelligent Agent Technology, IEEE/WIC/ACM international Conference on Web Intelligence and Intelligent Agent Technology, IEEE Computer Society, University of Technology, Sydney, Australia, pp. 884-887.View/Download from: Publisher's site
Market Surveillance plays an important role in maintaining market integrity, transparency and fairnesss. The existing trading pattern analysis only focuses on interday data which discloses explicit and high-level market dynamics. In the mean time, the existing market surveillance systems are facing challenges of misuse, mis-disclosure and misdealing of information, announcement and order in one market or crossing multiple markets. Therefore, there is a crucial need to develop workable methods for smart surveillance. To deal with such issues, we propose an innovative methodology -- microstructure activity pattern analysis. Based on this methodology, a case study in identifying exceptional microstructure activity patterns is carried out. The experiments on real-life stock data show that microstructure activity pattern analysis opens a new and effective means for crucially understanding and analysing market dynamics. The resulting findings such as exceptional microstructure activity patterns can greatly enhance the learning, detection, adaption and decision-making capability of market surveillance.
Ou, Y, Cao, L, Luo, C & Zhang, C 2008, 'Domain-Driven Local Exceptional Pattern Mining for Detecting Stock Price Manipulation', Lecture Notes in Computer Science Vol 5351: PRICAI 2008: Trends in Artificial Intelligence, Pacific Rim International Conference on Artificial Intelligence, Springer, Hanoi,Vietnam, pp. 849-858.View/Download from: Publisher's site
Recently, a new data mining methodology, Domain Driven Data Mining (D3M), has been developed. On top of data-centered pattern mining, D3M generally targets the actionable knowledge discovery under domain-specific circumstances. It strongly appreciates the involvement of domain intelligence in the whole process of data mining, and consequently leads to the deliverables that can satisfy business user needs and decision-making. Following the methodology of D3M, this paper investigates local exceptional patterns in real-life microstructure stock data for detecting stock price manipulations. Different from existing pattern analysis mainly on interday data, we deal with tick-by-tick data. Our approach proposes new mechanisms for constructing microstructure order sequences by involving domain factors and business logics, and for measuring the interestingness of patterns from business concern perspective. Real-life data experiments on an exchange data demonstrate that the outcomes generated by following D3M can satisfy business expectations and support business users to take actions for market surveillance.
Cao, L 2007, 'Multi-strategy Integration for Actionable Trading Agents', Workshop on Agents & Data Mining Interaction (ADMI 2007), International Workshop on Agents and Data Mining Interaction, IEEE Computer Soc, San Jose, USA, pp. 487-490.View/Download from: Publisher's site
Trading agents are very useful for developing and back-testing quality trading strategies to support smart trading actions in the market. However, the existing trading agent research mainly focuses on simple and simulated strategies. As a result, there exists a big gap between academia and business when the developed trading agents are deployed in the real life. Therefore, the actionable capability of developed trading agents is often very limited. In this paper, we introduce approaches for optimizing and integrating multiple classes of strategies for trading agents. Five categories of trading strategies, including 36 types of trading strategies are trained and tested. A strategy integration and optimization approach is proposed to identify golden trading strategy in each category, and finally recommend positions associated with these golden strategies to trading agents. Test in five international markets on ten years of data respectively has shown that the final strategies recommended to trading agents can lead to high benefits while low costs. Concurrent execution of positions recommended by all golden strategies can greatly enhance performance.
Cao, L & Zhang, C 2007, 'F-Trade: An Agent-Mining Symbiont for Financial Services', Agent & Data Mining Interaction, International Conference on Autonomous Agents and Multiagent Systems, IFAAMAS, Honolulu, Hawai'i, pp. 1363-1364.
The interaction and integration of agent technology and data mining presents prominent benefits to solve some of challenging issues in individual areas. For instance, data mining can enhance agent learning, while agent can benefit data mining with distributed pattern discovery. In this paper, we summarize the main functionalities and features of an agent service and data mining symbiont -- F-Trade. The F-Trade is constructed in Java agent service following the theory of open complex agent systems. We demonstrate the roles of agents in building up the F-Trade, as well as how agents can support data mining. On the other hand, data mining is used to strengthen agents. F-Trade provides flexible and efficient services of trading evidence back-testing, optimization and discovery, as well as plug and play of algorithms, data and system modules for financial trading and surveillance with online connectivity to huge quantities of global market data. and mining symbiont.
Cao, L, Luo, C & Zhang, C 2007, 'Agent-Mining Interaction: An Emerging Area', Autonomous Intelligent Systems: Multi-Agents and Data Mining, International Workshop Autonomous Intelligent Systems: Agents and Data Mining, Springer, St. Petersburg, Russia, pp. 60-73.View/Download from: Publisher's site
In the past twenty years, agents (we mean autonomous agent and multi-agent systems) and data mining (also knowledge discovery) have emerged separately as two of most prominent, dynamic and exciting research areas. In recent years, an increasingly remarkable trend in both areas is the agent-mining interaction and integration. This is driven by not only researcherâs interests, but intrinsic challenges and requirements from both sides, as well as benefits and complementarity to both communities through agent-mining interaction. In this paper, we draw a high-level overview of the agent-mining interaction from the perspective of an emerging area in the scientific family. To promote it as a newly emergent scientific field, we summarize key driving forces, originality, major research directions and respective topics, and the progression of research groups, publications and activities of agent-mining interaction. Both theoretical and application-oriented aspects are addressed. The above investigation shows that the agent-mining interaction is attracting everincreasing attention from both agent and data mining communities. Some complicated challenges in either community may be effectively and efficiently tackled through agent-mining interaction. However, as a new open area, there are many issues waiting for research and development from theoretical, technological and practical perspectives. This work is sponsored by Australian Research Council Discovery Grant (DP0773412, LP0775041, DP0667060, DP0449535), and UTS internal grants.
Cao, L, Luo, C & Zhang, C 2007, 'Developing Actionable Trading Strategies for Trading Agents', Proceedings of the IEEE/WIC/ACM International Conference on Intellligent Agent Technology, IEEE/WIC/ACM International Conference on Intelligent Agent Technology, IEEE Computer Soc, San Jose, pp. 72-75.View/Download from: Publisher's site
Trading agents are very useful for developing and backtesting quality trading strategies for actions taking in the real world. However, the existing trading agent research mainly focuses on simulation using artificial data and market models. As a result, the actionable capability of developed trading strategies is often limited. In this paper, we analyze such constraints on developing actionable trading strategies for trading agents. These points are deployed into developing a series of trading strategies for trading agents through optimizing, and enhancing actionable trading strategies. We demonstrate working case studies in large-scale of market data. These approaches and their performance are evaluated from both technical and business perspectives.
Cao, L, Luo, C & Zhang, C 2007, 'Developing actionable trading strategies for trading agents', PROCEEDINGS OF THE IEEE/WIC/ACM INTERNATIONAL CONFERENCE ON INTELLIGENT AGENT TECHNOLOGY (IAT 2007), IEEE/WIC/ACM International Conference on Intelligent Agent Technology and Web Intelligence, IEEE COMPUTER SOC, Fremont, CA, pp. 72-+.View/Download from: Publisher's site
Luo, D, Cao, L, Ni, J & Liu, L 2007, 'Building Agent Service Oriented Multi-Agent Systems', Agent and Multi-Agent Systems: Technologies and Applications, International KES Symposium on Agents and Multiagent systems - Technologies and Applications, Springer, Wroclaw, Poland, pp. 11-20.View/Download from: Publisher's site
An effective agent-based design approach is significant in engineering agent-based systems. Existing design approaches meet with challenges in designing Internet-based open agent systems. The emergence of service-oriented computing (SOC) brings in intrinsic mechanisms for complementing agent-based computing (ABS). In this paper, we investigate the dialogue between agent and service, and between ABS and SOC. As a consequence, we synthesize them and develop a design approach called agent service-oriented design (ASOD). The ASOD consists of agent service-based architectural design and detailed design. ASOD expands the content and range of agent and ABS, and synthesizes the qualities of SOC such as interoperability and openness, and the performances of ABC like flexibility and autonomy. The above techniques have been deployed in developing an online trading and mining support infrastructure F-Trade.
Zhao, Y, Zhang, H, Figueiredo, F, Cao, L & Zhang, C 2007, 'Mining for combined association rules on multiple datasets', Proceedings of the 2007 international workshop on Domain driven data mining, International Workshop on Domain Driven Data Mining, ACM, San Jose, USA, pp. 18-23.
Many organisations have their digital information stored in a distributed systems structure scheme, be it in different locations, using vertically and horizontally distributed repositories, which brings about an high level of complexity to data mining. From a classical data mining view, where the algorithms expect a denormalised structure to be able to operate on, heterogeneous data sources, such as static demographic and dynamic transactional data are to be manipulated and integrated to the extent commercial association rules algorithms can be applied. Bearing in mind the usefulness and understandability of the application from a business perspective, combined rules of multiple patterns derived from different repositories, containing historical and point in time data, were used to produce new techniques in association mining applied to debt recovery. Initially debt repayment patterns were discovered using transactional data and class labels defined by domain expertise, then demographic patterns were attached to each of the class labels. After combining the patterns, two type of rules were discovered leading to different results: 1) same demographic pattern with different repayment patterns, and 2) same repayment pattern with different demographic patterns. The rules produced are interesting, valuable, complete and understandable, which shows the applicability and effectiveness of the new method.
Cao, L, Zhao, Y, Figueiredo, F, Ou, Y & Luo, D 2007, 'Mining High Impact Exceptional Behavior Patterns', Emerging Technologies in Knowledge Discovery and Data Mining: Revised Selected Papers of PAKDD 2007 International Workshops, Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer, Nanjing, China, pp. 56-63.View/Download from: Publisher's site
In the real world, exceptional behavior can be seen in many situations such as security-oriented fields. Such behavior is rare and dispersed, while some of them may be associated with significant impact on the society. A typical example is the event September 11. The key feature of the above rare but significant behavior is its high potential to be linked with some significant impact. Identifying such particular behavior before generating impact on the world is very important. In this paper, we develop several types of high impact exceptional behavior patterns. The patterns include frequent behavior patterns which are associated with either positive or negative impact, and frequent behavior patterns that lead to both positive and negative impact. Our experiments in mining debt-associated customer behavior in social-security areas show the above approaches are useful in identifying exceptional behavior to deeply understand customer behavior and streamline business process.
Ou, Y, Cao, L, Yu, T & Zhang, C 2007, 'Detecting Turning Points of Trading Price and Return Volatility for Market', Workshop on Agents & Data Mining Interaction (ADMI 2007), International Workshop on Agents and Data Mining Interaction, IEEE Computer Soc, San Jose, pp. 491-494.View/Download from: Publisher's site
Trading agent concept is very useful for trading strategy design and market mechanism design. In this paper, we introduce the use of trading agent for market surveillance. Market surveillance agents can be developed for market surveillance officers and management teams to present them alerts and indicators of abnormal market movements. In particular, we investigate the strategies for market surveillance agents to detect the impact of company announcements on market movements. This paper examines the performance of segmentation on the time series of trading price and return volatility, respectively. The purpose of segmentation is to detect the turning points of market movements caused by announcements, which are useful to identify the indicators of insider trading. The experimental results indicate that the segmentation on the time series of return volatility outperforms that on the time series of trading price. It is easier to detect the turning points of return volatility than the turning points of trading price. The results will be used to code market surveillance agents for them to monitor abnormal market movements before the disclosure of market sensitive announcements. In this way, the market surveillance agents can assist market surveillance officers with indicators and alerts.
Ou, Y, Cao, L, Yu, T & Zhang, C 2007, 'Detecting turning points of trading price and return volatility for market surveillance agents', Proceedings - 2007 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Workshops, WI-IAT Workshops 2007, pp. 491-494.View/Download from: Publisher's site
Trading agent concept is very useful for trading strategy design and market mechanism design. In this paper, we introduce the use of trading agent for market surveillance. Market surveillance agents can be developed for market surveillance officers and management teams to present them alerts and indicators of abnormal market movements. In particular, we investigate the strategies for market surveillance agents to detect the impact of company announcements on market movements. This paper examines the performance of segmentation on the time series of trading price and return volatility, respectively. The purpose of segmentation is to detect the turning points of market movements caused by announcements, which are useful to identify the indicators of insider trading. The experimental results indicate that the segmentation on the time series of return volatility outperforms that on the time series of trading price. It is easier to detect the turning points of return volatility than the turning points of trading price. The results will be used to code market surveillance agents for them to monitor abnormal market movements before the disclosure of market sensitive announcements. In this way, the market surveillance agents can assist market surveillance officers with indicators and alerts. © 2007 IEEE.
Cao, L 2006, 'Activity mining: Challenges and prospects', Advanced Data Mining And Applications, Proceedings, Lecture Notes in Artificial Intelligence, International Conference on Advanced Data Mining and Applications, Springer-Verlag Berlin, Xi'an, China, pp. 582-593.View/Download from: Publisher's site
Activity data accumulated in real life, e.g. in terrorist activities and fraudulent customer contacts, presents special structural and semantic complexities. However, it may lead to or be associated with significant business impacts. For instance, a seri
Cao, L & Zhang, C 2006, 'Domain-driven actionable knowledge discovery in the real world', Advances In Knowledge Discovery And Data Mining, Proceedings, Lecture Notes in Artificial Intelligence, Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer-Verlag Berlin, Singapore, pp. 821-830.View/Download from: Publisher's site
Actionable knowledge discovery is one of Grand Challenges in KDD. To this end, many methodologies have been developed. However, they either view data mining as an autonomous data-driven trial-and-error process, or only analyze the issues in an isolated a
Cao, L, Luo, C, Ni, J, Luo, D & Zhang, C 2006, 'Stock data mining through fuzzy genetic algorithm', Proceedings of the 9th Joint Conference on Information Sciences, JCIS 2006.View/Download from: Publisher's site
Stock data mining such as financial pairs mining is useful for trading supports and market surveillance. Financial pairs mining targets mining pair relationships between financial entities such as stocks and markets. This paper introduces a fuzzy genetic algorithm framework and strategies for discovering pair relationship in stock data such as in high dimensional trading data by considering user preference. The developed techniques have a potential to mine pairs between stocks, between stock-trading rules, and between markets. Experiments in real stock data show that the proposed approach is useful for mining pairs helpful for real trading decision-support and market surveillance.
Cao, L, Luo, D & Zhang, C 2006, 'Fuzzy genetic algorithms for pairs mining', PRICAI 2006: Trends In Artificial Intelligence, Proceedings, Lecture Notes in Artificial Intelligence, Pacific Rim International Conference on Artificial Intelligence, Springer-Verlag Berlin, Guilin, China, pp. 711-720.View/Download from: Publisher's site
Pairs mining targets to mine pairs relationship between entities such as between stocks and markets in financial data mining. It has emerged as a kind of promising data mining applications. Due to practical complexities in the real-world pairs mining suc
Cao, L, Ni, J & Luo, D 2006, 'Ontological engineering in data warehousing', Frontiers Of WWW Research And Development - Apweb 2006, Proceedings - Lecture Notes in Computer Science, Asia Pacific Web Conference, Springer-Verlag Berlin, Harbin, China, pp. 923-929.View/Download from: Publisher's site
In our previous work, we proposed the ontology-based integration of data warehousing to make existing data warehouse system more user-friendly, adaptive and automatic. This paper further outlines a high-level picture of the ontological engineering in dat
Ni, J, Cao, L & Zhang, C 2006, 'Agent services-oriented architectural design of a framework for artificial stock markets', Advances in Intelligent IT: Active Media Technology 2006, International Conference on Active Media Technology, IOS Press, Brisbane, Australia, pp. 396-399.
Zhang, C & Cao, L 2006, 'Domain-driven mining: Methodologies and applications', Advances in Intelligent IT: Active Media Technology 2006, International Conference on Active Media, IOS Press, Brisbane, Australia, pp. 13-16.
Zhao, Y, Cao, L, Morrow, YK, Ou, Y, Ni, J & Zhang, C 2006, 'Discovering debtor patterns of Centrelink customers', Data mining 2006; Proceedings of AusDM 2006, Australian Data Mining Conference, ACS Inc, Sydney, Australia, pp. 135-144.
Cao, L, Schurmann, R & Zhang, C 2005, 'Domain-Driven In-Depth pattern Discovery: A Practical Methodology', Proceedings 4th Australasion Data Mining Conference AusDM05, Australian Data Mining Conference, The University of Technology, Sydney, Sydney, Australia, pp. 101-114.
Cao, L, Zhang, C & Ni, J 2005, 'Agent services-oriented architectural design of open complex agent systems', Proceedings of 2005 IEEE/WIC/ACM International Conference On Intelligent Agent Technology, IEEE/WIC/ACM International Conference on Intelligent Agent Technology, IEEE, Compiegne, France, pp. 120-123.View/Download from: Publisher's site
Architectural design is a critical phase in building agent-based systems. However, most of existing agent-oriented software engineering approaches deliver weak or incomplete supports for the architectural design of distributed and especially Internet-based agent systems. On the other hand, the emergence of service-oriented computing (SOC) brings in intrinsic mechanisms for complementing agent-based computing (ABC). In this paper, we investigate the dialogue between ABC and SOC and their integration in implementing architectural design. We synthesize them and develop the computational concept agent service, and build a new design approach called agent service-oriented architectural design (ASOAD). The ASOAD expands the contents and ranges of agent and ABC, and synthesize the qualities of SOC such as interoperability and openness with the performances of ABC like flexibility and autonomy. It is suitable for designing distributed agent systems and agent service-based enterprise application integration.
Cao, L, Zhang, C, Luo, D, Chen, W & Zamari, N 2004, 'Integrative Early Requirements Analysis for Agent-Based Software', Fourth International Conference on Hybrid Intelligent Systems HIS-2004, International Conference on Hybrid Intelligent Systems, IEEE Computer Society Press, Kitakyushu, Japan, pp. 1-6.View/Download from: Publisher's site
Early requirements analysis (ERA) is quite significant for building agent-based systems. Goal-oriented requirements analysis is promising for the agent-oriented early requirements analysis. In general, either visual modeling or formal specifications is u
Cao, LB, Zhang, CQ, Lu, D, Chen, WL & Zamani, N 2004, 'Integrative early requirements analysis for agent-based systems', HIS'04: FOURTH INTERNATIONAL CONFERENCE ON HYBRID INTELLIGENT SYSTEMS, PROCEEDINGS, 4th International Conference on Hybrid Intelligent Systems (HIS 04), IEEE COMPUTER SOC, Kitakyushu, JAPAN, pp. 118-123.
Lin, L, Cao, L & Zhang, C 2005, 'Genetic algorithms for robust optimization in financial applications', Proceedings Of The Iasted International Conference On Computational Intelligence, IASTED International Conference on Computational Intelligence, ACTA Press, Calgary, Canada, pp. 387-391.
In stock market or other financial market systems, the technical trading rules are used widely to generate buy and sell alert signals. In each rule, there are many parameters. The users often want to get the best signal series from the in-sample sets, (H
Lin, L, Cao, L & Zhang, C 2005, 'The Fish-Eye Visualization of foreign Currency Exchange Data Streams', Asia-Pacific Symposium on Information Visualisation 2005, Asia-Pacific Symposium on Information Visualisation, ACS, Sydney, Australia, pp. 91-96.
In a foreign currency exchange market, there are high-density data streams. The present approaches for visualization of this type of data cannot show us a figure with targeted both local details and global trend information. In this paper, on the basis of features and attributes of foreign currency exchange trading streams, we discuss and compare multiple approaches including interactive zooming, multiform sampling with combination of attribute of large foreign currency exchange data, and fish-eye view embedded visualization for visual display of high-density foreign currency exchange transactions. By comparison, Fish-eye-based visualization is the best option, which can display regional records in details without losing global movement trend in the market in a limited display window. We used Fish-eye technology for output visualization of foreign currency exchange trading strategies in our trading support system linking to real-time foreign currency market closing data:
Lin, L, Cao, L & Zhang, C 2005, 'The Visualization of Large Database in Stock Markets', Proceedings of the IASTED International Conference on Databases and Applications, IASTED International Multi Conference, ACTA Press, Innsbruck, Austria, pp. 163-166.
Cao, L, Luo, C, Luo, D & Liu, L 2004, 'Ontology services-based information integration in mining telecom business intelligence', Pricai 2004: Trends In Artificial Intelligence, Proceedings, Pacific Rim International Conference on Artificial Intelligence, Springer-Verlag Berlin, Auckland, New Zealand, pp. 85-94.
Cao, L, Luo, C, Luo, D & Zhang, C 2004, 'Hybrid Strategy of Analysis and Control of Telecommunications Frauds', Proceedings of 2nd International Conference on Information Technology and Applications, International Conference on Information Technology and Applications, IEEE, Harbin, China, pp. 11-15.
Cao, L, Luo, C, Luo, D & Zhang, C 2004, 'Hybrid strategy of analysis and control of telecommunications frauds', Proceedings of the Second International Conference on Information Technology and Applications (ICITA 2004), pp. 281-285.
The problem of telecommunications frauds has been getting more and more serous for many years, and is even getting more and more worse not only in western countries but also in some developing countries. Detection, Analysis and prevention mechanisms are emerging both from telecommunications operators and academia. In this paper, we present a hybrid strategy of analysis and control of telecommunications frauds from engineering viewpoint Our first task is to identify the complexity of telecommunications frauds, we discuss possible fraud scenarios and their evolution. Furthermore, in order to build an information system to deal with realistic telecommunications frauds, we summarize and propose a hybrid strategy, which includes a solution package, five models and four types of analyses, to construct a loop-dosed system for analysis and control of frauds. We further discuss a system framework for analysis and control of telecommunications frauds.
Cao, L, Luo, C, Luo, D & Zhang, C 2004, 'Integration of Business Intelligence Based on Three-Level Ontology Services', IEEE/WIC/ACM International Conference on Web Intelligence (WI2004), IEEE/WIC/ACM international Conference on Web Intelligence and Intelligent Agent Technology, IEEE, Beijing, China, pp. 17-23.View/Download from: Publisher's site
Usually, integration of business intelligence (BI) from realistic telecom enterprise is by packing data warehouse (DW), OLAP, data mining and reporting from different vendors together.As a result, BI system users are transferred to a reporting system with reports, data models, dimensions and measures predefined by system designers.As a result of survey, 85% of DW projects failed to meet their intended objectives.In this paper, we investigate how to integrate BI packages into an adaptive and flexible knowledge portal by constructing an internal link and communication channel from top-level business concepts to underlying enterprise information systems (EIS).An approach of three-level ontology services is developed, which implements unified naming, directory and transport of ontology services, and ontology mapping and query parsing among conceptual view, analytical view and physical view from user interfaces through DW to EIS.Experiments on top of real telecom EIS shows that our solution for integrating BI presents much stronger power to support operational decision making more user-friendly and adaptively compared with those simply combining BI products presently available together.
Cao, L, Luo, D, Luo, C & Liu, L 2004, 'Ontology Transformation in Multiple Domains', AI 2004: Advances in Artificial Intelligence, 17th Australian Joint Conference on Artificial Intelligence Cairns, Australia, December 2004 Proceedings, Australasian Joint Conference on Artificial Intelligence, Springer, Cairns, Australia, pp. 985-990.View/Download from: Publisher's site
We have proposed a new approach called ontology services-driven integration of business intelligence (BI) to designing an integrated BI platform. In such a BI platform, multiple ontological domains may get involved, such as domains for business, reporting, data warehouse, and multiple underlying enterprise information systems. In general, ontologies in the above multiple domains are heterogeneous. So, a key issue emerges in the process of building an integrated BI platform, that is, how to support ontology transformation and mapping between multiple ontological domains. In this paper, we present semantic aggregations of semantic relationships and ontologies in one or multiple domains, and the ontological transformation from one domain to another. Rules for the above semantic aggregation and transformation are described. This work is the foundation for supporting BI analyses crossing multiple domains.
Cao, L, Ni, J, Wang, J & Zhang, C 2004, 'Agent Services-Driven-plug-and-play in F-Trade', AI 2004: Advances in Artificial Intelligence, 17th Australian Joint Conference on Artificial Intelligence Cairns, Australia, December 2004 Proceedings, Australasian Joint Conference on Artificial Intelligence, Springer, Cairns, Australia, pp. 917-922.View/Download from: Publisher's site
We have built an agent service-based enterprise infrastructure: F-TRADE. With its online connectivity to huge real stock data in global markets, it can be used for online evaluation of trading strategies and data mining algorithms. The main functions in the F-TRADE include soft plug-and-play, and back-testing, optimization, integration and evaluation of algorithms. In this paper, we'll focus on introducing the intelligent plug-and-play, which is a key system function in the F-TRADE. The basic idea for the soft plug-and-play is to build agent services which can support the online plug-in of agents, algorithms and data sources. Agent UML-based modeling, role model and agent services for the plug-and-play are discussed. With this design, algorithm providers, data source providers, and system module developers of the F-TRADE can expand system functions and resources by online plugging them into the F-TRADE.
Cao, L, Wang, J, Lin, L & Zhang, C 2004, 'Agent Services - Based infrastructure for online assessment of training strategies', Proceedings IEEE/WIC/ACM International Conference on Intelligent Agent Technology (IAT 2004), IEEE/WIC/ACM International Conference on Intelligent Agent Technology, IEEE, Beijing, China, pp. 345-348.View/Download from: Publisher's site
Traders and researchers in stock marketing often hold some private trading strategies. Evaluation and optimization of their strategies is a great benefit to them before they take any risk in realistic trading. We build an agent services-driven infrastructure: F-TRADE. It supports online plug in, iterative back-test, and recommendation of trading strategies. We propose agent services-driven approach for building the above automated enterprise infrastructure. Description, directory and mediation of agent services are discussed. System structure of the agent services-based F-TRADE is also discussed. F-TRADE has been an online test platform for research and application of multi-agent technology, and data mining in stock markets
Lin, L, Cao, L, Wang, J & Zhang, C 2004, 'The Applications of genetic algorithms in stock market data mining optimisation', Data Mining V, Data Mining, Text Mining and Their Business Application, Conference on Data Mining, Text Mining and Their Business Application, Wessex Institute of Technology Press, Malaga, Spain, pp. 273-280.
Cao, L, Li, C, Zhang, C & Dai, RW 2003, 'Open Giant Intelligent Information Systems and Its Multiagent-Oriented System Design', Proceedings of the International Conference on Software Engineering Research and Practice Volume II, International Conference on Software Engineering Research and Practice, CSREA Press, Las Vegas, Nevada, USA, pp. 816-822.
Cao, L, Luo, C, Li, C, Zhang, C & Dai, RW 2003, 'Open Giant Intelligent Information Systems and Its Agent-Oriented Abstraction Mechanism', Proceedings of the Fifteenth International Conference on Software Engineering and Knowledge Engineering, International Conference on Software Engineering and Knowledge Engineering, Knowledge Systems Institute, San Francisco, California, USA, pp. 85-89.
Cao, L, Luo, D, Luo, C & Zhang, C 2003, 'Systematic Engineering in Designing Architecture of Telecommunications Business Intelligence System', Design and Application of Hybrid Intelligent Systems, HIS03, the Third International Conference on Hybrid Intelligent Systems, International Conference on Hybrid Intelligent Systems, IOS Press, Melbourne, Australia, pp. 1084-1093.
Li, C, Zhang, C & Cao, L 2003, 'Theoretical Evaluation of Ring-Based Architectural Model for Middle Agents in Agent-Based System', Foundations of Intelligent Systems. 14th Symposium, ISMIS 2003 Proceedings, International Symposium on Foundations of Intelligent Systems, Springer-Verlag Berlin Heidelberg, Maebashi City, Japan, pp. 603-607.View/Download from: Publisher's site
Ring-based architectural model is usually employed to promote the scalability and robustness of agent-based systems. However there are no criteria for evaluating the performance of ring-based architectural model. In this paper, we introduce an evaluation approach to comparing the performance of ring-based architectural model with other ones. In order to evaluate ring-based architectural model, we proposed an application-based information-gathering system with middle agents, which are organized with ring-based architectural model and solve the matching problem between service provider agents and requester agents. We evaluate the ring-based architectural model with performance predictability, adaptability, and availability. We demonstrate the potentials of ring-based architectural model by the results of evaluation.
Li, FANGFANG, Xu, G & Cao, LONGBING 2015, 'CSAL: Self-adaptive Labeling based Clustering Integrating Supervised Learning on Unlabeled Data.'.
Supervised classification approaches can predict labels for unknown data because of the supervised training process. The success of classification is heavily dependent on the labeled training data. Differently, clustering is effective in revealing the aggregation property of unlabeled data, but the performance of most clustering methods is limited by the absence of labeled data. In real applications, however, it is time-consuming and sometimes impossible to obtain labeled data. The combination of clustering and classification is a promising and active approach which can largely improve the performance. In this paper, we propose an innovative and effective clustering framework based on self-adaptive labeling (CSAL) which integrates clustering and classification on unlabeled data. Clustering is first employed to partition data and a certain proportion of clustered data are selected by our proposed labeling approach for training classifiers. In order to refine the trained classifiers, an iterative process of Expectation-Maximization algorithm is devised into the proposed clustering framework CSAL. Experiments are conducted on publicly data sets to test different combinations of clustering algorithms and classification models as well as various training data labeling methods. The experimental results show that our approach along with the self-adaptive method outperforms other methods.
Li, FANGFANG, Zhao, YANCHANG, Felsche, K, Xu, G & Cao, LONGBING 2015, 'Coupling Analysis Between Twitter and Call Centre'.
Social media has been contributing many research areas such as data mining, recommender systems, time series analysis, etc. However, there are not many successful applications regarding social media in government agencies. In fact, lots of governments have social media accounts such as twitter and facebook. More and more customers are likely to communicate with governments on social media, causing massive external social media data for governments. This external data would be beneficial for analysing behaviours and real needs of the customers. Besides this, most governments also have a call centre to help customers solve their problems. It is not difficult to imagine that the enquiries on external social media and internal call centre may have some coupling relationships. The couplings could be helpful for studying customers' intent and allocating government's limited resources for better service. In this paper, we mainly focus on analysing the coupling relations between internal call centre and external public media using time series analysis methods for Australia Department of Immigration and Border Protec-tion. The discovered couplings demonstrate that call centre and public media indeed have correlations, which are significant for understanding customers' behaviours.
As one of very few data science teams in the world that effectively integrate quality original research with high-impact practices, Longbing has been working with over 20 tier-one industry and government organizations, leading and managing large innovation-driven consultancy and contract research projects, including Australian Commonwealth Government departments and agencies, such as Department of Human Services (Centrelink), Department of Immigration and Border Protection, Australian Taxation Office, IP Australia, Department of Finance and Services, NSW Office of State Revenue, Westpac Banking Corporation, Commonwealth Bank of Australia, National Australian Bank, Hospital C Fund (HCF), AMP, Insurance Australia Group, Microsoft, SAS, Teradata and Capital Markets Cooperative Research Centre.
Their work has led to significant amounts of dollar savings ($Bn) for relevant organizations, and strong impact by addressing challenging governmental and business problems, recognized in governmental reports, media articles/program and OECD report.