UTS site search

Associate Professor Jinyan Li

Biography

Dr. Jinyan Li is the Bioinformatics Program Leader and Associate Professor at the Advanced Analytics Institute and Centre for Health Technologies, Faculty of Engineering and IT, UTS. Jinyan has a Bachelor degree of Science (Applied Mathematics) from National University of Defense Technology (China), a Masters degree of Engineering (Computer Engineering) from Hebei University of Technology (China), and a PhD degree (Computer Science) from the University of Melbourne (Australia). He joined UTS in March of 2011 after ten years of research and teaching work in Singapore (Institute for Infocomm Research, Nanyang Technological University, and National University of Singapore). 

Jinyan loves research on protein bindng free energy prediction, conformational B-cell epitope prediction, PPIs, disease-RNA-gene tripartite, NGS data management, and RNA-seq data anaysis. He also loves research on data mining algorithms and new machine learning methods. He has published 90 journal articles and 80 conference papers, of which many are highly cited. The journals he loves to publish include: Machine Learning, Artificial Intelligence, Data Mining and Knowledge Discovery, IEEE TKDE, Bioinformatics, Nucleic Acids Research and Cancer Cell. The conferences  include KDD, ICML, PODS, ICDT, ICDE, ICDM and SDM. Jinyan has 4 patents.

Jinyan is widely known for his pioneering and theoretical research work on emerging patterns that has spawned numerous follow-up research interests in data mining, machine learning, and bioinformatics and made an enduring contribution to these fields.

Professional

Associate Editor, BMC Bioinformatics

Academic Editor, PLoS ONE

PC Co-chair, ADMA 2016

PC C-chair, ICIC 2016

Workshop Co-chair, PAKDD 2016

Image of Jinyan Li
Associate Professor, Faculty of Engineering & Information Technology
Core Member, CHT - Centre for Health Technologies
Core Member, AAI - Advanced Analytics Institute
Bachelor of Science, Doctor of Philosophy
 
Phone
+61 2 9514 9264

Research Interests

Bioinformatics and computational biology, immunoinformatics, data mining, graph theory, machine learning, and information theory.

Can supervise: Yes

Introduction to bioinformatics, data mining and advanced data analysis.

Books

Li, J. & Wong, L. 2003, Using rules to analyse bio-medical data: A comparison between C4.5 and PCL.
For easy comprehensibility, rules are preferrable to non-linear kernel functions in the analysis of bio-medical data. In this paper, we describe two rule induction approaches - C4.5 and our PCL classifier - for discovering rules from both traditional clinical data and recent gene expression or proteomic profiling data. C4.5 is a widely used method, but it has two weaknesses, the single coverage constraint and the fragmentation problem, that affect its accuracy. PCL is a new rule-based classifier that overcomes these two weaknesses of decision trees by using many significant rules. We present a thorough comparison to show that our PCL method is much more accurate than C4.5, and it is also superior to Bagging and Boosting in general. © Springer-Verlag Berlin Heidelberg 2003.

Chapters

Li, J. & Wong, L. 2013, 'Emerging Pattern-Based Rules Characterizing Subtypes of Leukemia' in Dong, G. & Bailey, J. (eds), Contrast Data Mining: Concepts, Algorithms, and Applications, CRC Press, USA, pp. 219-232.
View/Download from: UTS OPUS or Publisher's site
NA
Li, J. & Liu, Q. 2013, 'Protein Binding Interfaces and Their Binding Hot Spot Prediction: A Survey' in Bioinformatics for Diagnosis, Prognosis and Treatment of Complex Diseases, Springer, German, pp. 79-106.
View/Download from: UTS OPUS or Publisher's site
In living organisms, genes are the blueprints or library, specifying instructions for building proteins. Proteins constitute the bulk of cells. Proteins mutual binding and interactions play a vital role in numerous functions and activities, such as signal transduction, enzymatic reactions, immunoreactions and inter-cellular communications. This survey provides basic knowledge of proteins and protein binding. First, we describe proteins fundamental elements, structures and functions. In Sect. 5.2, we present concepts related to protein binding and interactions. In Sect. 5.3, we explain why protein binding interfaces have a uneven distribution of binding free energy. In the Sects. 5.4 and 5.5, we explain why protein interfaces are complicated and how the current studies deal with this difficult problem. In Sect. 5.6, we present an overview on methods to model and predict binding free energy of protein interactions. Section 5.7 concludes this survey with a summary.
Feng, M., Li, J., Dong, G. & Wong, L. 2009, 'Maintenance of frequent patterns: A survey' in Post-Mining of Association Rules: Techniques for Effective Knowledge Extraction, pp. 273-293.
View/Download from: Publisher's site
This chapter surveys the maintenance of frequent patterns in transaction datasets. It is written to be accessible to researchers familiar with the field of frequent pattern mining. The frequent pattern maintenance problem is summarized with a study on how the space of frequent patterns evolves in response to data updates. This chapter focuses on incremental and decremental maintenance. Four major types of maintenance algorithms are studied: Apriori-based, partition-based, prefix-tree-based, and concise-representation-based algorithms. The authors study the advantages and limitations of these algorithms from both the theoretical and experimental perspectives. Possible solutions to certain limitations are also proposed. In addition, some potential research opportunities and emerging trends in frequent pattern maintenance are also discussed1. © 2009, IGI Global.
Dong, G., Li, J., Liu, G. & Wong, L. 2009, 'Mining conditional contrast patterns' in Post-Mining of Association Rules: Techniques for Effective Knowledge Extraction, pp. 294-310.
View/Download from: Publisher's site
This chapter considers the problem of "conditional contrast pattern mining." It is related to contrast mining, where one considers the mining of patterns/models that contrast two or more datasets, classes, conditions, time periods, and so forth. Roughly speaking, conditional contrasts capture situations where a small change in patterns is associated with a big change in the matching data of the patterns. More precisely, a conditional contrast is a triple (B, F1, F2) of three patterns; B is the condition/context pattern of the conditional contrast, and F 1 and F 2 are the contrasting factors of the conditional contrast. Such a conditional contrast is of interest if the difference between F 1 and F 2 as itemsets is relatively small, and the difference between the corresponding matching dataset of BF 1 and that of BF2 is relatively large. It offers insights on "discriminating" patterns for a given condition B. Conditional contrast mining is related to frequent pattern mining and analysis in general, and to the mining and analysis of closed pattern and minimal generators in particular. It can also be viewed as a new direction for the analysis (and mining) of frequent patterns. After formalizing the concepts of conditional contrast, the chapter will provide some theoretical results on conditional contrast mining. These results (i) relate conditional contrasts with closed patterns and their minimal generators, (ii) provide a concise representation for conditional contrasts, and (iii) establish a so-called dominance-beam property. An efficient algorithm will be proposed based on these results, and experiment results will be reported. Related works will also be discussed. © 2009, IGI Global.

Conferences

Ghosh, S., Feng, M., Nguyen, H. & Li, J. 2014, 'Predicting Heart Beats using Co-occurring Constrained Sequential Patterns', http://www.cinc.org/archives/2014/, Computing in Cardiology, IEEE, Boston USA, pp. 265-268.
View/Download from: UTS OPUS
The aim of this study is to develop and evaluate a robust method for heart beat detection using a sequential pattern mining framework, based on the multi-modal Physionet 2014 challenge dataset. Each multi-modal patient time series was initially transformed to a symbolic sequence using Symbolic Aggregation Approximation (SAX). A training set was created, by randomly selecting 70% of the data and the rest 30% was used as the test set. Later, all segments of length 100 were extracted, for annotated beat occurrences. Subsequently, an algorithm was used to extract repetitive frequent subsequences, where consecutive symbols are separated by a pre-defined gap range. The patterns for ECG and BP were then ranked based on length and frequency support. For tests, the highest ranked patterns were used to mark beat segments. True beat occurrences were only considered when patterns co-occurred for both ECG and BP within a width of 150 time points. Our results comprise two parts viz. extracted top ranked sequences and gross test statistics. An interpretive highest ranked sequential pattern for ECG looks like [7,7,7,5,5,5,5,5,4,3,10,10,10,2,2,3,3,4,3,4,5,5,5,6,7], for 10 discrete symbols which identify regional signal activity, with a gap range of [2,4] between contiguous elements. As per our test results, the method gives us a sensitivity of 51.66% and a positive predictivity (PPV) of 67.15%. The novelty of mining gap constrained co-occurring frequent sequential patterns lies in its ability to capture approximate co-occurring long clinical episodes across multiple variables, even if the quality of one signal suffers for a certain period of time. A higher PPV indicates that our method did not have a lot of false positives (detecting non-beats). The method is still being improved and will be further tested in the next stages of the Ph
Ghosh, S., Feng, M., Nguyen, H. & Li, J. 2014, 'Risk Prediction for Acute Hypotensive Patients by using Gap Constrained Sequential Contrast Patterns', http://knowledge.amia.org/56638-amia-1.1540970/t-004-1.1544972?qr=1, American Medical Informatics Association (AMIA) 2014 Annual Symposium, AMIA, Washington D.C., USA.
View/Download from: UTS OPUS
Wei, W., Yin, J., Li, J. & Cao, L. 2014, 'Modelling Asymmetry and Tail Dependence among Multiple Variables by Using Partial Regular Vine', Proceedings of the 2014 SIAM International Conference on Data Mining, 2014 SIAM International Conference on Data Mining, SIAM, Philadelphia, USA, pp. 776-784.
View/Download from: Publisher's site
Modeling high-dimensional dependence is widely studied to explore deep relations in multiple variables particularly useful for financial risk assessment. Very often, strong restrictions are applied on a dependence structure by existing high-dimensional dependence models. These restrictions disabled the detection of sophisticated structures such as asymmetry, upper and lower tail dependence between multiple variables. The paper proposes a partial regular vine copula model to relax these restrictions. The new model employs partial correlation to construct the regular vine structure, which is algebraically independent. This model is also able to capture the asymmetric characteristics among multiple variables by using two-parametric copula with flexible lower and upper tail dependence. Our method is tested on a cross-country stock market data set to analyse the asymmetry and tail dependence. The high prediction performance is examined by the Value at Risk, which is a commonly adopted evaluation measure in financial market. Read More: http://epubs.siam.org/doi/abs/10.1137/1.9781611973440.89
Wei, W., Li, J., Cao, L., Sun, J., Liu, C. & Li, M. 2013, 'Optimal Allocation of High Dimensional Assets through Canonical Vines', Advances in Knowledge Discovery and Data Mining: 17th Pacific-Asia Conference, PAKDD 2013, Gold Coast, Australia, April 14-17, 2013, Proceedings, Part I, Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer, Gold Coast, Australia, pp. 366-377.
View/Download from: UTS OPUS or Publisher's site
Canonical Vine, Mean Variance Criterion, Financial Return.
Wei, W., Fan, X., Li, J. & Cao, L. 2012, 'Model the Complex Dependence Structures of Financial Variables by Using Canonical Vine', The 21st ACM International Conference on Information and Knowledge Management, The 21st ACM International Conference on Information and Knowledge Management (CIKM2012), Springer, Maui, Hawaii, USA, pp. 1382-1391.
View/Download from: UTS OPUS or Publisher's site
Financial variables such as asset returns in the massive market contain various hierarchical and horizontal relationships forming complicated dependence structures. Modeling and mining of these structures is challenging due to their own high structural complexities as well as the stylized facts of the market data. This paper introduces a new canonical vine dependence model to identify the asymmetric and non-linear dependence structures of asset returns without any prior independence assumptions. To simplify the model while maintaining its merit, a partial correlation based method is proposed to optimize the canonical vine. Compared with the original canonical vine, the new model can still maintain the most important dependence but many unimportant nodes are removed to simplify the canonical vine structure. Our model is applied to construct and analyze dependence structures of European stocks as case studies. Its performance is evaluated by measuring portfolio of Value at Risk, a widely used risk management measure. In comparison to a very recent canonical vine model and the `full' model, our experimental results demonstrate that our model has a much better quality of Value at Risk, providing insightful knowledge for investors to control and reduce the aggregation risk of the portfolio.
Li, J., Liu, Q. & Zeng, T. 2010, 'Negative correlations in collaboration: concepts and algorithms', Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, Washington DC, pp. 463-472.
View/Download from: UTS OPUS
Zhu, L. & Li, J. 2010, 'Water bioinformatics: An association between estrogen degradation and 16S rRNA motifs', 2010 4th International Conference on Bioinformatics and Biomedical Engineering, iCBBE 2010.
View/Download from: Publisher's site
The existence of estrogenic compounds in the water severely pollutes the ecological environment. It is believed that microorganisms such as harmless bacterium can be used as a clean and safe medium to naturally degrade the estrogens. Many bacteria have been found to be capable of degrading estrogens in different ways and speeds. While the degradation mechanism, in particular, the association between the degradation capability and their phylogenetic motifs is unknown yet. In this paper, we analyzed the 16S rRNA gene sequences of 17 kinds of bacteria, which are usually used for phylogenetic studies. We examined the association between motifs and degradation by distinguishing such motifs that could separate those bacteria into several similar functional groups. Our computational result shows that the motifs have a various positive associations to the degradation, implying that different biodegradation factors are in the play. © 2010 IEEE.
Tang, M.J., Wang, W., Jiang, Y., Zhou, Y., Li, J., Cui, P., Liu, Y. & Yan, B. 2010, 'Birds bring flues? Mining frequent and high weighted cliques from birds migration networks', Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), pp. 359-369.
View/Download from: Publisher's site
Recent advances in satellite tracking technologies can provide huge amount of data for biologists to understand continuous long movement patterns of wild bird species. In particular, highly correlated habitat areas are of great biological interests. Biologists can use this information to strive potential ways for controlling highly pathogenic avian influenza. We convert these biological problems into graph mining problems. Traditional models for frequent graph mining assign each vertex label with equal weight. However, the weight difference between vertexes can make strong impact on decision making by biologists. In this paper, by considering different weights of individual vertex in the graph, we develop a new algorithm, Helen, which focuses on identifying cliques with high weights. We introduce "graph-weighted support framework" to reduce clique candidates, and then filter out the low weighted cliques. We evaluate our algorithm on real life birds' migration data sets, and show that graph mining can be very helpful for ecologists to discover unanticipated bird migration relationships. © Springer-Verlag Berlin Heidelberg 2010.
Chen, P. & Li, J. 2009, 'Prediction of protein long-range contacts using GaMC approach with sequence profile centers', Proceedings - 2009 IEEE International Conference on Bioinformatics and Biomedicine Workshops, BIBMW 2009, pp. 128-135.
View/Download from: Publisher's site
In this paper, we apply an evolutionary optimization classifier, referred to as genetic algorithm-based multiple classifier (GaMC), to the long-range contacts prediction. As a result, about 44.1% contacts between long-range residues (with a sequence separation of at least 24 amino acids) are founded around the sequence profile (SP) centre when evaluating the top L/5 (L is the sequence length of protein) classified contacts if the SP centers are known. Meanwhile, with the knowledge of sequence profile center and the GaMC method, about 20.42% long-range contacts are correctly predicted. Results showed that SP center may be a sound pathway to predict contact map in protein structures. ©2009 IEEE.
Zhao, L. & Li, J. 2009, 'Sequence-based B-cell epitope prediction by using associations in antibody-antigen structural complexes', Proceedings - 2009 IEEE International Conference on Bioinformatics and Biomedicine Workshops, BIBMW 2009, pp. 165-172.
View/Download from: Publisher's site
B-cell secreted antibodies play a critical role in fighting against the invaders and abnormal self tissues. Identifying the epitope on antigens recognized by the paratope on antibodies can enlighten the understanding of this important immune mechanism. Predicting B-cell epitope can also pave the way for vaccine design and disease therapy. However, due to the high complexity of this problem, previous prediction methods that focus on linear and conformational epitope are both unsatisfactory. In this work, we propose a novel method to predict B-cell epitopes, when a pair of sequences is given, by using associations and cooperativity patterns from a relatively small antigen-antibody structural data set. More exactly, our classifier is trained on only PDB protein complexes, but it can be applied to any sequence data. Our evaluation results show that the accuracy of our method is very competitive to, sometimes even much better than, previous structure-based prediction methods which have a smaller applicability scope than ours. ©2009 IEEE.
Liu, Q., Chen, Y.P.P. & Li, J. 2009, 'High functional coherence in k-partite protein cliques of protein interaction networks', 2009 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2009, pp. 111-117.
View/Download from: Publisher's site
We introduce a new topological concept called kpartite protein cliques to study protein interaction (PPI) networks. In particular, we examine functional coherence of proteins in kpartite protein cliques. A k-partite protein clique is a k-partite maximal clique comprising two or more nonoverlapping protein subsets between any two of which full interactions are exhibited. In the detection of PPI's k-partite maximal cliques, we propose to transform PPI networks into induced K-partite graphs with proteins as vertices where edges only exist among the graph's partites. Then, we present a k-partite maximal clique mining (MaCMik) algorithm to enumerate k-partite maximal cliques from K-partite graphs. Our MaCMik algorithm is applied to a yeast PPI network. We observe that there does exist interesting and unusually high functional coherence in k-partite protein cliques - most proteins in k-partite protein cliques, especially those in the same partites, share the same functions. Therefore, the idea of k-partite protein cliques suggests a novel approach to characterizing PPI networks, and may help function prediction for unknown proteins. © 2009 IEEE.
Tang, M., Zhou, Y., Cui, P., Wang, W., Li, J., Zhang, H., Hou, Y. & Yan, B. 2009, 'Discovery of migration habitats and routes of wild bird species by clustering and association analysis', Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), pp. 288-301.
View/Download from: Publisher's site
Knowledge about the wetland use of migratory bird species during the annual life circle is very interesting to biologists, as it is critically important for conservation site construction and avian influenza control. The raw data of the habitat areas and the migration routes can be determined by high-tech GPS satellite telemetry, that usually are large scale with high complexity. In this paper, we convert these biological problems into computational studies, and introduce efficient algorithms for the data analysis. Our key idea is the concept of hierarchical clustering for migration habitat localization, and the notion of association rules for the discovery of migration routes. One of our clustering results is the Spatial-Tree, an illusive map which depicts the home range of bar-headed geese. A related result to this observation is an association pattern that reveals a high possibility of bar-headed geese's potential migration routes. Both of them are of biological novelty and meaning. © 2009 Springer.
Lo, D., Khoo, S. & Li, J. 2008, 'Mining and Ranking Generators of Sequential Patterns', Proceedings of the 2008 SIAM International Conference on Data Mining, SIAM, Atlanta, pp. 553-564.
View/Download from: UTS OPUS
Li, J., Sim, K., Liu, G. & Wong, L. 2008, 'Maximal Quasi-Bicliques with Balanced Noise Tolerance: Concepts and Co-clustering Applications', Proceedings of the 8th SIAM International Conference on Data Mining (SDM08), SIAM, Atlanta, pp. 72-83.
View/Download from: UTS OPUS
Liu, X., Li, J. & Wang, L. 2008, 'Quasi-bicliques: Complexity and Binding Pairs', Proceedings of the 14th Annual International Conference, COCOON 2008, Annual International Computing and Combinatorics Conference, Springer, Dalian, pp. 255-264.
View/Download from: UTS OPUS
Protein-protein interactions (PPIs) are one of the most important mechanisms in cellular processes. To model protein interaction sites, recent studies have suggested to find interacting protein group pairs from large PPI networks at the first step, and then to search conserved motifs within the protein groups to form interacting motif pairs. To consider noise effect and incompleteness of biological data, we propose to use quasi-bicliques for finding interacting protein group pairs. We investigate two new problems which arise from finding interacting protein group pairs: the maximum vertex quasi-biclique problem and the maximum balanced quasi-biclique problem. We prove that both problems are NP-hard. This is a surprising result as the widely known maximum vertex biclique problem is polynomial time solvable [16]. We then propose a heuristic algorithm which uses the greedy method to find the quasi-bicliques from PPI networks. Our experiment results on real data show that this algorithm has a better performance than a benchmark algorithm for identifying highly matched BLOCKS and PRINTS motifs.
Lukman, S., Sim, K., Li, J. & Chen, Y.P.P. 2008, 'Interacting amino acid preferences of 3D pattern pairs at the binding sites of transient and obligate protein complexes', Series on Advances in Bioinformatics and Computational Biology, pp. 69-78.
To assess the physico-chemical characteristics of protein-protein interactions, protein sequences and overall structural folds have been analyzed previously. To highlight this, discovery and examination of amino acid patterns at the binding sites defined by structural proximity in 3-dimensional (3D) space are essential. In this paper, we investigate the interacting preferences of 3D pattern pairs discovered separately in transient and obligate protein complexes. These 3D pattern pairs are not necessarily sequence-consecutive, but each residue in two groups of amino acids from two proteins in a complex is within certain Å threshold to most residues in the other group. We develop an algorithm called AApairs by which every pair of interacting proteins is represented as a bipartite graph, and it discovers all maximal quasi-bicliques from every bipartite graph to form our 3D pattern pairs. From 112 and 2533 highly conserved 3D pattern pairs discovered in the transient and obligate complexes respectively, we observe that Ala and Leu is the highest occuring amino acid in interacting 3D patterns of transient (20.91%) and obligate (33.82%) complexes respectively. From the study on the dipeptide composition on each side of interacting 3D pattern pairs, dipeptides Ala-Ala and Ala-Leu are popular in 3D patterns of both transient and obligate complexes. The interactions between amino acids with large hydrophobicity difference are present more in the transient than in the obligate complexes. On contrary, in obligate complexes, interactions between hydrophobic residues account for the top 5 most occuring amino acid pairings.
Feng, M., Li, J., Wong, L. & Tan, Y.P. 2008, 'Negative generator border for effective pattern maintenance', Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), pp. 217-228.
View/Download from: Publisher's site
In this paper, we study the maintenance of frequent patterns in the context of the generator representation. The generator representation is a concise and lossless representation of frequent patterns. We effectively maintain the generator representation by systematically expanding its Negative Generator Border. In the literature, very few work has addressed the maintenance of the generator representation. To illustrate the proposed maintenance idea, a new algorithm is developed to maintain the generator representation for support threshold adjustment. Our experimental results show that the proposed algorithm is significantly faster than other state-of-the-art algorithms. This proposed maintenance idea can also be extended to other representations of frequent patterns as demonstrated in this paper. © 2008 Springer-Verlag Berlin Heidelberg.
Li, J. & Hu, X. 2007, 'Workshop BioDM'07 - An overview', Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), pp. 110-111.
This edited volume contains the papers selected for presentation at the Second Workshop on Data Mining for Biomedical Applications (BioDM'07) held in Nanjing, China on 22nd of May 2007. The workshop was held in conjunction with the 11th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2007), a leading international conference in the areas of data mining and knowledge discovery. The aim of this workshop was to provide a forum for discussing research topics related to biomedical applications where data mining techniques were found to be necessary and/or useful. © Springer-Verlag Berlin Heidelberg 2007.
Vellaisamy, K. & Li, J. 2007, 'Multidimensional decision support indicator (mDSI) for time series stock trend prediction', Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), pp. 841-848.
This work proposes a generalized approach for predicting trends in time series data with a particular interest in stocks. In this approach, we suggest a multidimensional decision support indicator mDSI derived from a sequential data mining process to monitor trends in stocks. Available indicators in the literature often fail to agree with their predictions to their competitors because of the specific nature of features each one uses in their predictions like moving averages use means, momentums use dispersions, etc. Then again, choosing a best indicator is a challenging and also expensive one. Thus, in this paper, we introduce a compact, but robust indicator to learn the trends effectively for any given time series data. That is, it introduces a simple multdimensional indicator such as mDSI which integrates multiple decision criteria into a single index value that to eliminate conflicts and improve the overall efficiency. Experiments with mDSI on the real data further confirm its efficiency and good performance. © Springer-Verlag Berlin Heidelberg 2007.
Li, J., Liu, G. & Wong, L. 2007, 'Mining statistically important equivalence classes and delta-discriminative emerging patterns', Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 430-439.
View/Download from: Publisher's site
The support-confidence framework is the most common measure used in itemset mining algorithms, for its antimonotonicity that effectively simplifies the search lattice. This computational convenience brings both quality and statistical flaws to the results as observed by many previous studies. In this paper, we introduce a novel algorithm that produces itemsets with ranked statistical merits under sophisticated test statistics such as chi-square, risk ratio, odds ratio, etc. Our algorithm is based on the concept of equivalence classes. An equivalence class is a set of frequent itemsets that always occur together in the same set of transactions. Therefore, itemsets within an equivalence class all share the same level of statistical significance regardless of the variety of test statistics. As an equivalence class can be uniquely determined and concisely represented by a closed pattern and a set of generators, we just mine closed patterns and generators, taking a simultaneous depth-first search scheme. This parallel approach has not been exploited by any prior work. We evaluate our algorithm on two aspects. In general, we compare to LCM and FPclose which are the best algorithms tailored for mining only closed patterns. In particular, we compare to epMiner which is the most recent algorithm for mining a type of relative risk patterns, known as minimal emerging patterns. Experimental results show that our algorithm is faster than all of them, sometimes even multiple orders of magnitude faster. These statistically ranked patterns and the efficiency have a high potential for real-life applications, especially in biomedical and financial fields where classical test statistics are of dominant interest. © 2007 ACM.
Feng, M., Dong, G., Li, J., Tan, Y.P. & Wong, L. 2007, 'Evolution and maintenance of frequent pattern space when transactions are removed', Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), pp. 489-497.
This paper addresses the maintenance of discovered frequent patterns when a batch of transactions are removed from the original dataset. We conduct an in-depth investigation on how the frequent pattern space evolves under transaction removal updates using the concept of equivalence classes. Inspired by the evolution analysis, an effective and exact algorithm TRUM is proposed to maintain frequent patterns. Experimental results demonstrate that our algorithm outperforms representative state-of-the-art algorithms. © Springer-Verlag Berlin Heidelberg 2007.
Liu, G., Li, J., Sim, K. & Wong, L. 2007, 'Distance based subspace clustering with flexible dimension partitioning', Proceedings - International Conference on Data Engineering, pp. 1250-1254.
View/Download from: Publisher's site
Traditional similarity or distance measurements usually become meaningless when the dimensions of the datasets increase, which has detrimental effects on clustering performance. In this paper, we propose a distance-based subspace clustering model, called nCiuster, to find groups of objects that have similar values on subsets of dimensions. Instead of using a grid based approach to partition the data space into non-overlapping rectangle cells as in the density based subspace clustering algorithms, the nCiuster model uses a more flexible method to partition the dimensions to preserve meaningful and significant clusters. We develop an efficient algorithm to mine only maximal nClusters. A set of experiments are conducted to show the efficiency of the proposed algorithm and the effectiveness of the new model in preserving significant clusters. © 2007 IEEE.
Ben, N., Yang, Q., Li, J., Chi-Keung, S. & Pal, S. 2007, 'Discovering patterns of DNA methylation: Rule mining with rough sets and decision trees, and comethylation analysis', Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), pp. 389-397.
DNA methylation regulates the transcription of genes without changing their coding sequences. It plays a vital role in the process of embryogenesis and tumorgenesis. To gain more insights into how such epigenetic mechanism works in the human cells, we apply the two popular data mining techniques, i.e., Rough Sets, and Decision Trees, to uncover the logical rules of DNA methylation. Our results show that the Rough Sets method can generate and utilize fewer rules to fully separate the methylation dataset, whereas Decision Trees method relies on more rules but involves fewer decision variables to do the same task. We also find that some of the gene promoters are highly comethylated, demonstrating the evidence that genes are highly interactive epigenetically in human cells. © Springer-Verlag Berlin Heidelberg 2007.
Liu, G., Li, J., Wong, L. & Hsu, W. 2006, 'Positive Borders or Negative Borders: How to Make Lossless Generator Based Representations Concise', PROCEEDINGS OF THE SIXTH SIAM INTERNATIONAL CONFERENCE ON DATA MINING, pp. 469-473.
Chen, L., Bhowmick, S.S. & Li, J. 2006, 'Mining temporal indirect associations', Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), pp. 425-434.
This paper presents a novel pattern called temporal indirect association. An indirect association pattern refers to a pair of items that rarely occur together but highly depend on the presence of a mediator itemset. The existing model of indirect association does not consider the lifespan of items. Consequently, some discovered patterns may be invalid while some useful patterns may not be covered. To overcome this drawback, in this paper, we take into account the lifespan of items to extend the current model to be temporal. An algorithm, MG-Grvwth, that finds the set of mediators in pattern-growth manner is developed. Then, we extend the framework of the algorithm to discover temporal indirect associations. Our experimental results demonstrated the efficiency and effectiveness of the proposed algorithms. © Springer-Verlag Berlin Heidelberg 2006.
Sim, K., Li, J., Gopalkrishnan, V. & Liu, G. 2006, 'Mining maximal quasi-bicliques to co-cluster stocks and financial ratios for value investment', Proceedings - IEEE International Conference on Data Mining, ICDM, pp. 1059-1063.
View/Download from: Publisher's site
We introduce an unsupervised process to co-cluster groups of stocks and financial ratios, so that investors can gain more insight on how they are correlated. Our idea for the co-clustering is based on a graph concept called maximal quasi-bicliques, which can tolerate erroneous or/and missing information that are common in the stock and financial ratio data. Compared to previous works, our maximal quasi-bicliques require the errors to be evenly distributed, which enable us to capture more meaningful co-clusters. We develop a new algorithm that can efficiently enumerate maximal quasi-bicliques from an undirected graph. The concept of maximal quasi-bicliques is domain-independent; it can be extended to perform co-clustering on any set of data that are modeled by graphs. © 2006 IEEE.
Li, J., Li, H., Wong, L., Pei, J. & Dong, G. 2006, 'Minimum description length principle: Generators are preferable to closed patterns', Proceedings of the National Conference on Artificial Intelligence, pp. 409-414.
The generators and the unique closed pattern of an equivalence class of itemsets share a common set of transactions. The generators are the minimal ones among the equivalent itemsets, while the closed pattern is the maximum one. As a generator is usually smaller than the closed pattern in cardinality, by the Minimum Description Length Principle, the generator is preferable to the closed pattern in inductive inference and classification. To efficiently discover frequent generators from a large dataset, we develop a depth-first algorithm called Gr-growth. The idea is novel in contrast to traditional breadth-first bottom-up generator-mining algorithms. Our extensive performance study shows that Gr-growth is significantly faster (an order or even two orders of magnitudes when the support thresholds are low) than the existing generator mining algorithms. It can be also faster than the state-of-the-art frequent closed itemset mining algorithms such as FPclose and CLOSET+. Copyright © 2006, American Association for Artificial Intelligence (www.aaai.org). All rights reserved.
Liu, G., Sim, K. & Li, J. 2006, 'Efficient mining of large maximal bicliques', Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), pp. 437-448.
Many real world applications rely on the discovery of maximal biclique subgraphs (complete bipartite subgraphs). However, existing algorithms for enumerating maximal bicliques are not very efficient in practice. In this paper, we propose an efficient algorithm to mine large maximal biclique subgraphs from undirected graphs. Our algorithm uses a divide-and-conquer approach. It effectively uses the size constraints on both vertex sets to prune unpromising bicliques and to reduce the search space iteratively during the mining process. The time complexity of the proposed algorithm is O(nd N), where n is the number of vertices, d is the maximal degree of the vertices and N is the number of maximal bicliques. Our performance study shows that the proposed algorithm outperforms previous work significantly. © Springer-Verlag Berlin Heidelberg 2006.
Chen, L., Bhowmick, S.S. & Li, J. 2006, 'COWES: Clustering web users based on historical web sessions', Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), pp. 541-556.
View/Download from: Publisher's site
Clustering web users is one of the most important research topics in web usage mining. Existing approaches cluster web users based on the snapshots of web user sessions. They do not take into account the dynamic nature of web usage data. In this paper, we focus on discovering novel knowledge by clustering web users based on the evolutions of their historical web sessions. We present an algorithm called COWES to cluster web users in three steps. First, given a set of web users, we mine the history of their web sessions to extract interesting patterns that capture the characteristics of their usage data evolution. Then, the similarity between web users is computed based on their common interesting patterns. Then, the desired clusters are generated by a partitioning clustering technique. Web user clusters generated based on their historical web sessions are useful in intelligent web advertisement and web caching. © Springer-Verlag Berlin Heidelberg 2006.
Vellaisamy, K. & Li, J. 2006, 'Bayesian approaches to ranking sequential patterns interestingness', Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), pp. 241-250.
One of the main issues in the rule/pattern mining is of measuring the interestingness of a pattern. The interestingness has been evaluated previously in literature using several approaches for association as well as for sequential mining. These approaches generally view a sequence as another form of association for computations and understanding. But, by doing so, a sequence might not be fully understood for its statistical significance such as dependence and applicability. This paper proposes a new framework to study sequences' interestingness. It suggests two kinds of Markov processes, namely Bayesian networks, to represent the sequential patterns. The patterns are studied for statistical dependencies in order to rank the sequential patterns interestingness. This procedure is very shown when the domain knowledge is not easily accessible. © Springer-Verlag Berlin Heidelberg 2006.
Dong, G.Z., Jiang, C.Y., Pei, J., Li, J.Y. & Wong, L. 2005, 'Mining succinct systems of minimal generators of formal concepts', DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, PROCEEDINGS, pp. 175-187.
Li, J.Y., Li, H.Q., Soh, D. & Wong, L. 2005, 'A correspondence between maximal complete bipartite subgraphs and closed patterns', KNOWLEDGE DISCOVERY IN DATABASES: PKDD 2005, pp. 146-156.
Li, H., Li, J., Wong, L., Feng, M. & Tan, Y.P. 2005, 'Relative risk and odds ratio: A data mining perspective', Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pp. 368-377.
We are often interested to test whether a given cause has a given effect. If we cannot specify the nature of the factors involved, such tests are called model-free studies. There are two major strategies to demonstrate associations between risk factors (ie. patterns) and outcome phenotypes (ie. class labels). The first is that of prospective study designs, and the analysis is based on the concept of "relative risk": What fraction of the exposed (ie. has the pattern) or unexposed (ie. lacks the pattern) individuals have the phenotype (ie. the class label)? The second is that of retrospective designs, and the analysis is based on the concept of "odds ratio": The odds that a case has been exposed to a risk factor is compared to the odds for a case that has not been exposed. The efficient extraction of patterns that have good relative risk and/or odds ratio has not been previously studied in the data mining context. In this paper, we investigate such patterns. We show that this pattern space can be systematically stratified into plateaus of convex spaces based on their support levels. Exploiting convexity, we formulate a number of sound and complete algorithms to extract the most general and the most specific of such patterns at each support level. We compare these algorithms. We further demonstrate that the most efficient among these algorithms is able to mine these sophisticated patterns at a speed comparable to that of mining frequent closed patterns, which are patterns that satisfy considerably simpler conditions. Copyright 2005 ACM.
Li, J., Liu, H. & Li, N. 2005, 'Diagnostic rules induced by an ensemble method for childhood leukemia', Proceedings - BIBE 2005: 5th IEEE Symposium on Bioinformatics and Bioengineering, pp. 246-249.
View/Download from: Publisher's site
We introduce a new ensemble method based on decision tree to discover significant and diversified rules for subtype classification of childhood acute lymphoblastic leukemia, a heterogeneous disease with individual subtypes differing in their response to chemotherapy. Our approach simply uses each of top-ranked features as root node to build up different trees in the ensemble. Since these trees are all generated from original training samples, rules derived by our algorithm are true and reliable. This is a characteristic of our method contrast to state-of-the-art methods such as Bagging, Boosting and Random Forest which may produce false rules. Experimental results on a large gene expression profiling data set of childhood leukemia patients demonstrate that our proposed method is not only superior to other classifiers' performance, but also can identify a small subset of genes for biomarker analysis. © 2005 IEEE.
Liu, H., Li, J. & Wong, L. 2004, 'Selection of patient samples and genes for outcome prediction', Proceedings - 2004 IEEE Computational Systems Bioinformatics Conference, CSB 2004, pp. 382-392.
Gene expression profiles with clinical outcome data enable monitoring of disease progression and prediction of patient survival at the molecular level. We present a new computational method for outcome prediction. Our idea is to use an informative subset of original training samples. This subset consists of only short-term survivors who died within a short period and long-term survivors who were still alive after a long follow-up time. These extreme training samples yield a clear platform to identify genes whose expression is related to survival. To find relevant genes, we combine two feature selection methods - entropy measure and Wilcoxon rank sum test - so that a set of sharp discriminating features are identified. The selected training samples and genes are then integrated by a support vector machine to build a prediction model, by which each validation sample is assigned a survival/relapse risk score for drawing Kaplan-Meier survival curves. We apply this method to two data sets: diffuse large-B-cell lymphoma (DLBCL) and primary lung adenocarcinoma. In both cases, patients in high and low risk groups stratified by our risk scores are clearly distinguishable. We also compare our risk scores to some clinical factors, such as International Prognostic Index score for DLBCL analysis and tumor stage information for lung adenocarcinoma. Our results indicate that gene expression profiles combined with carefully chosen learning algorithms can predict patient survival for certain diseases.
Li, J. & Ramamohanarao, K. 2004, 'A tree-based approach to the discovery of diagnostic biomarkers for ovarian cancer', Lecture Notes in Artificial Intelligence (Subseries of Lecture Notes in Computer Science), pp. 682-691.
Computational diagnosis of cancer is a classification problem, and it has two special requirements on a learning algorithm; perfect accuracy and small number of features used in the classifier. This paper presents our results on an ovarian cancer data set. This data set is described by 15154 features, and consists of 253 samples. Each sample is referred to a woman who suffers from ovarian cancer or who does not have. In fact, the raw data is generated by the so-called mass spectrosnietry technology measuring the intensities of 15154 protein or peptide-features in a blood sample for every woman. The purpose is to identify a small subset of the features that can be used as biomarkers to separate the two classes of samples with high accuracy. Therefore, the identified features can be potentially used in routine clinical diagnosis for replacing labour-intensive and expensive conventional diagnosis methods. Our new tree-based method can achieve the perfect 100% accuracy in 10-fold cross validation on this data set. Meanwhile, this method also directly outputs a small set of biomarkers. Then we explain why support vector machines, naive bayes, and k-nearest neighbour cannot fulfill the purpose. This study is also aimed to elucidate the communication between contemporary cancer research and data mining techniques.
Li, J. & Liu, H. 2003, 'Ensembles of cascading trees', Proceedings - IEEE International Conference on Data Mining, ICDM, pp. 585-588.
We introduce a new method, called CS4, to construct committees of decision trees for classification. The method considers different top-ranked features as the root nodes of member trees. This idea is particularly suitable for dealing with high-dimensional bio-medical data as top-ranked features in this type of data usually possess similar merits for classification. To make a decision, the committee combines the power of individual trees in a weighted manner. Unlike Bagging or Boosting which uses bootstrapped training data, our method builds all the member trees of a committee using exactly the same set of training data. We have tested these ideas on UCI data sets as well as recent bio-medical data sets of gene expression or proteomic profiles that are usually described by more than 10,000 features. All the experimental results show that our method is efficient and that the classification performance are superior to C4.5 family algorithms. © 2003 IEEE.
Li, J.Y. & Wong, L. 2002, 'Solving the fragmentation problem of decision trees by discovering boundary emerging patterns', 2002 IEEE INTERNATIONAL CONFERENCE ON DATA MINING, PROCEEDINGS, pp. 653-656.
Li, J. & Wong, L. 2002, 'Geography of differences between two classes of data', Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), pp. 325-337.
Easily comprehensible ways of capturing main differences between two classes of data are investigated in this paper. In addition to examining individual differences, we also consider their neighbourhood. The new concepts are applied to three gene expression datasets to discover diagnostic gene groups. Based on the idea of prediction by collective likelihoods (PCL), a new method is proposed to classify testing samples. Its performance is competitive to several state-of-the-art algorithms. © 2002 Springer-Verlag Berlin Heidelberg.
Li, J., Ramamohanarao, K. & Dong, G. 2001, 'Combining the strength of pattern frequency and distance for classification', Lecture Notes in Artificial Intelligence (Subseries of Lecture Notes in Computer Science), pp. 455-466.
© Springer-Verlag Berlin Heidelberg 2001. Supervised classification involves many heuristics, including the ideas of decision tree, k-nearest neighbour (k-NN), pattern frequency, neural network, and Bayesian rule, to base induction algorithms. In this paper, we propose a new instance-based induction algorithm which combines the strength of pattern frequency and distance. We define a neighbourhood of a test instance. If the neighbourhood contains training data, we use k-NN to make decisions. Otherwise, we examine the support (frequency) of certain types of subsets of the test instance, and calculate support summations for prediction. This scheme is intended to deal with outliers: when no training data is near to a test instance, then the distance measure is not a proper predictor for classification. We present an effective method to choose an "optimal" neighbourhood factor for a given data set by using a guidance from a partial training data. In this work, we find that our algorithm maintains (sometimes exceeds) the outstanding accuracy of k-NN on data sets containing pure continuous attributes, and that our algorithm greatly improves the accuracy of k-NN on data sets containing a mixture of continuous and categorical attributes. In general, our method is much superior to C5.0.
Li, J., Ramamohanarao, K. & Dong, G. 2000, 'Emerging patterns and classification', Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), pp. 15-32.
© Springer-Verlag Berlin Heidelberg 2000. In this work, we review an important kind of knowledge pattern, emerging patterns (EPs). Emerging patterns are associated with two data sets, and can be used to describe significant changes between the two data sets. To discover all EPs embedded in high-dimension and large-volume databases is a challenging problem due to the number of candidates. We describe a special type of EP, called jumping emerging patterns (JEPs) and review some properties of JEP spaces (the spaces of jumping emerging patterns). We describe efficient border-based algorithms to derive the boundary elements of JEP spaces. Moreover, we describe a new classifier, called DeEPs, which makes use of the discriminating power of emerging patterns. The experimental results show that the accuracy of DeEPs is much better than that of k-nearest neighbor and that of C5.0.
Li, J., Zhang, X., Dong, G., Ramamohanarao, K. & Sun, Q. 1999, 'Efficient mining of high confidence association rules without support thresholds', Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), pp. 406-411.
© Springer-Verlag Berlin Heidelberg 1999. Association rules describe the degree of dependence between items in transactional datasets by their confidences. In this paper, we first introduce the problem of mining top rules, namely those association rules with 100% confidence. Traditional approaches to this problem need a minimum support (minsup) threshold and then can discover the top rules with supports minsup; such approaches, however, rely on minsup to help avoid examining too many candidates and they miss those top rules whose supports are below minsup. The low support top rules (e.g. some unusual combinations of some factors that have always caused some disease) may be very interesting. Fundamentally different from previous work, our proposed method uses a dataset partitioning technique and two border-based algorithms to efficiently discover all top rules with a given consequent, without the constraint of support threshold. Importantly, we use borders to concisely represent all top rules, instead of enumerating them individually. We also discuss how to discover all zero-confidence rules and some very high (say 90%) confidence rules using approaches similar to mining top rules. Experimental results using the Mushroom, the Cleveland heart disease, and the Boston housing datasets are reported to evaluate the efficiency of the proposed approach.
Dong, G., Zhang, X., Wong, L. & Li, J. 1999, 'CAEP: Classification by aggregating emerging patterns', Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), pp. 30-42.
View/Download from: Publisher's site
© Springer-Verlag Berlin Heidelberg 1999. Emerging patterns (EPs) are itemsets whose supports change significantly from one dataset to another; they were recently proposed to capture multi-attribute contrasts between data classes, or trends over time. In this paper we propose a new classifier, CAEP, using the follow- ing main ideas based on EPs: (i) Each EP can sharply differentiate the class membership of a (possibly small) fraction of instances containing the EP, due to the big difference between its supports in the opposing classes; we define the differentiating power of the EP in terms of the supports and their ratio, on instances containing the EP. (ii) For each instance t, by aggregating the differentiating power of a fixed, automat- ically selected set of EPs, a score is obtained for each class. The scores for all classes are normalized and the largest score determines t's class. CAEP is suitable for many applications, even those with large volumes of high (e.g. 45) dimensional data; it does not depend on dimension re- duction on data; and it is usually equally accurate on all classes even if their populations are unbalanced. Experiments show that CAEP has consistent good predictive accuracy, and it almost always outperforms C4.5 and CBA. By using efficient, border-based algorithms (developed elsewhere) to discover EPs, CAEP scales up on data volume and dimen- sionality. Observing that accuracy on the whole dataset is too coarse description of classifiers, we also used a more accurate measure, sensi- tivity and precision, to better characterize the performance of classifiers. CAEP is also very good under this measure.
Li, W., Wang, Y., Li, W., Zhang, J. & Li, J. 1998, 'Sparselized higher-order neural network and its pruning algorithm', IEEE International Conference on Neural Networks - Conference Proceedings, pp. 359-362.
In this paper, the fully-connected higher-order neuron and sparselized higher-order neuron are introduced, the mapping capabilities of the fully-connected higher-order neural networks are investigated, and that arbitrary Boolean function defined from {0,1}N can be realized by fully-connected higher-order neural networks is proved. Based on this, in order to simplify the networks' architecture, a pruning algorithm of eliminating the redundant connection weights is also proposed, which can be applied to the implementation of sparselized higher-order neural classifier and other networks. The simulated results show the effectiveness of the algorithm.
Dong, G. & Li, J. 1998, 'Interestingness of discovered association rules in terms of neighborhood-based unexpectedness', Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), pp. 72-86.
View/Download from: Publisher's site
© Springer-Verlag Berlin Heidelberg 1998. One of the central problems in knowledge discovery is the development of good measures of interestingness of discovered patterns. With such measures, a user needs to manually examine only the more interesting rules, instead of each of a large number of mined rules. Previous proposals of such measures include rule templates, minimal rule cover, actionability, and unexpectedness in the statistical sense or against user beliefs. In this paper we will introduce neighborhood-based interestingness by considering unexpectedness in terms of neighborhood-based parameters. We first present some novel notions of distance between rules and of neighborhoods of rules. The neighborhood-based interestingness of a rule is then defined in terms of the pattern of the fluctuation of confidences or the density of mined rules in some of its neighborhoods. Such interestingness can also be defined for sets of rules (e.g. plateaus and ridges) when their neighborhoods have certain properties. We can rank the interesting rules by combining some neighborhood-based characteristics, the support and confidence of the rules, and users' feedback. We discuss how to implement the proposed ideas and compare our work with related ones. We also give a few expected tendencies of changes due to rule structures, which should be taken into account when considering unexpectedness. We concentrate on association rules and briefly discuss generalization to other types of rules.

Journal articles

Liu, Q., Song, J. & Li, J. 2016, 'Using contrast patterns between true complexes and random subgraphs in PPI networks to predict unknown protein complexes.', Scientific reports, vol. 6, p. 21223.
Most protein complex detection methods utilize unsupervised techniques to cluster densely connected nodes in a protein-protein interaction (PPI) network, in spite of the fact that many true complexes are not dense subgraphs. Supervised methods have been proposed recently, but they do not answer why a group of proteins are predicted as a complex, and they have not investigated how to detect new complexes of one species by training the model on the PPI data of another species. We propose a novel supervised method to address these issues. The key idea is to discover emerging patterns (EPs), a type of contrast pattern, which can clearly distinguish true complexes from random subgraphs in a PPI network. An integrative score of EPs is defined to measure how likely a subgraph of proteins can form a complex. New complexes thus can grow from our seed proteins by iteratively updating this score. The performance of our method is tested on eight benchmark PPI datasets and compared with seven unsupervised methods, two supervised and one semi-supervised methods under five standards to assess the quality of the predicted complexes. The results show that in most cases our method achieved a better performance, sometimes significantly.
Wang, C., Fang, Y. & Li, J.Y. 2016, 'Estimation of NON-WSSUS channel for OFDM system: Exploiting support correlations through a novel adaptive weighted predict-re-estimate L1 minimization approach', Journal of Communications, vol. 11, no. 2, pp. 149-156.
View/Download from: Publisher's site
© 2016 Journal of Communications. It is challenging to estimate the wireless channel of the Orthogonal Frequency-Division Multiplexing (OFDM) broadband system under a changing communication environment. The difficulty is mainly attributed to this wireless channel's Non Wide Sense Stationary Uncorrelated Scattering (Non-WSSUS) which has an implication that the delay and Doppler shift of such a channel are non-stationary and correlated. A Non-WSSUS channel is very different from the classical time-varying channel with constant delay and Doppler shift. In this paper, we propose an estimation method for the Non-WSSUS Channel Impulse Response (CIR) of the OFDM system. Based on the sparsity property of the delay-Doppler spread function, the delay and Doppler shift of Non-WSSUS channel can be extracted through a Compressive Sensing (CS) approach. Then a novel CS algorithm referred as Pre-Re L1 is proposed. The proposed CS algorithm exploits the correlations of the sparse supports to obtain adaptive weights for L1minimization. Numerical Simulation results show that the proposed CS method improves the performance of the Non-WSSUS wireless channel estimation.
Wang, C., Dong, X., Han, L., Su, X.D., Zhang, Z., Li, J. & Song, J. 2016, 'Identification of WD40 repeats by secondary structure-aided profile-profile alignment.', Journal of theoretical biology.
A WD40 protein typically contains four or more repeats of ~40 residues ended with the Trp-Asp dipeptide, which folds into -propellers with four strands in each repeat. They often function as scaffolds for protein-protein interactions and are involved in numerous fundamental biological processes. Despite their important functional role, the "velcro" closure of WD40 propellers and the diversity of WD40 repeats make their identification a difficult task. Here we develop a new WD40 Repeat Recognition method (WDRR), which uses predicted secondary structure information to generate candidate repeat segments, and further employs a profile-profile alignment to identify the correct WD40 repeats from candidate segments. In particular, we design a novel alignment scoring function that combines dot product and BLOSUM62, thereby achieving a great balance of sensitivity and accuracy. Taking advantage of these strategies, WDRR could effectively reduce the false positive rate and accurately identify more remote homologous WD40 repeats with precise repeat boundaries. We further use WDRR to re-annotate the Pfam families in the -propeller clan (CL0186) and identify a number of WD40 repeat proteins with high confidence across nine model organisms. The WDRR web server and the datasets are available at http://protein.cau.edu.cn/wdrr/.
Tee, A.E., Liu, B., Song, R., Li, J., Pasquier, E., Cheung, B.B., Jiang, C., Marshall, G.M., Haber, M., Norris, M.D., Fletcher, J.I., Dinger, M.E. & Liu, T. 2016, 'The long noncoding RNA MALAT1 promotes tumor-driven angiogenesis by up-regulating pro-angiogenic gene expression', Oncotarget, vol. 7, no. 8, pp. 8663-8675.
View/Download from: Publisher's site
Neuroblastoma is the most common solid tumor during early childhood. One of the key features of neuroblastoma is extensive tumor-driven angiogenesis due to hypoxia. However, the mechanism through which neuroblastoma cells drive angiogenesis is poorly understood. Here we show that the long noncoding RNA MALAT1 was upregulated in human neuroblastoma cell lines under hypoxic conditions. Conditioned media from neuroblastoma cells transfected with small interfering RNAs (siRNA) targeting MALAT1, compared with conditioned media from neuroblastoma cells transfected with control siRNAs, induced significantly less endothelial cell migration, invasion and vasculature formation. Microarray-based differential gene expression analysis showed that one of the genes most significantly downregulated following MALAT1 suppression in human neuroblastoma cells under hypoxic conditions was fibroblast growth factor 2 (FGF2). RT-PCR and immunoblot analyses confirmed that MALAT1 suppression reduced FGF2 expression, and Enzyme-Linked Immunosorbent Assays revealed that transfection with MALAT1 siRNAs reduced FGF2 protein secretion from neuroblastoma cells. Importantly, addition of recombinant FGF2 protein to the cell culture media reversed the effects of MALAT1 siRNA on vasculature formation. Taken together, our data suggest that up-regulation of MALAT1 expression in human neuroblastoma cells under hypoxic conditions increases FGF2 expression and promotes vasculature formation, and therefore plays an important role in tumor-driven angiogenesis.
Song, R., Liu, Q., Liu, T. & Li, J. 2015, 'Connecting rules from paired miRNA and mRNA expression data sets of HCV patients to detect both inverse and positive regulatory relationships', BMC Genomics, vol. 16, no. Suppl 2.
View/Download from: UTS OPUS or Publisher's site
Intensive research based on the inverse expression relationship has been undertaken to discover the miRNA-mRNA regulatory modules involved in the infection of Hepatitis C virus (HCV), the leading cause of chronic liver diseases. However, biological studies in other fields have found that inverse expression relationship is not the only regulatory relationship between miRNAs and their targets, and some miRNAs can positively regulate a mRNA by binding at the 5' UTR of the mRNA.
Zhao, Z., Han, G.-.S., Yu, Z.-.G. & Li, J. 2015, 'Laplacian Normalization and Random Walk on Heterogeneous Networks for Disease-gene Prioritization', Computational Biology and Chemistry, vol. 57, pp. 21-28.
View/Download from: UTS OPUS or Publisher's site
Random walk on heterogeneous networks is a recently emerging approach to effective disease gene prioritization. Laplacian normalization is a technique capable of normalizing the weight of edges in a network. We use this technique to normalize the gene matrix and the phenotype matrix before the construction of the heterogeneous network, and also use this idea to define the transition matrices of the heterogeneous network. Our method has remarkably better performance than the existing methods for recovering known gene–phenotype relationships. The Shannon information entropy of the distribution of the transition probabilities in our networks is found to be smaller than the networks constructed by the existing methods, implying that a higher number of top-ranked genes can be verified as disease genes. In fact, the most probable gene–phenotype relationships ranked within top 3 or top 5 in our gene lists can be confirmed by the OMIM database for many cases. Our algorithms have shown remarkably superior performance over the state-of-the-art algorithms for recovering gene–phenotype relationships. All Matlab codes can be available upon email request.
Li, Z., He, Y., Wong, L. & Li, J. 2015, 'Burial Level Change Defines a High Energetic Relevance for Protein Binding Interfaces', IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, vol. 12, no. 2, pp. 410-421.
View/Download from: UTS OPUS or Publisher's site
Song, R., Catchpoole, D.R., Kennedy, P.J. & Li, J. 2015, 'Identification of lung cancer miRNA-miRNA co-regulation networks through a progressive data refining approach.', Journal of theoretical biology, vol. 380, pp. 271-279.
Co-regulations of miRNAs have been much less studied than the research on regulations between miRNAs and their target genes, although these two problems are equally important for understanding the entire mechanisms of complex post-transcriptional regulations. The difficulty to construct a miRNA-miRNA co-regulation network lies in how to determine reliable miRNA pairs from various resources of data related to the same disease such as expression levels, gene ontology (GO) databases, and protein-protein interactions. Here we take a novel integrative approach to the discovery of miRNA-miRNA co-regulation networks. This approach can progressively refine the various types of data and the computational analysis results. Applied to three lung cancer miRNA expression data sets of different subtypes, our method has identified a miRNA-miRNA co-regulation network and co-regulating functional modules common to lung cancer. An example of these functional modules consists of genes SMAD2, ACVR1B, ACVR2A and ACVR2B. This module is synergistically regulated by let-7a/b/c/f, is enriched in the same GO category, and has a close proximity in the protein interaction network. We also find that the co-regulation network is scale free and that lung cancer related miRNAs have more synergism in the network. According to our literature survey and database validation, many of these results are biologically meaningful for understanding the mechanism of the complex post-transcriptional regulations in lung cancer.
Liu, Q., Ren, J., Song, J. & Li, J. 2015, 'Co-Occurring Atomic Contacts for the Characterization of Protein Binding Hot Spots.', PloS one, vol. 10, no. 12, pp. e0144486-e0144486.
A binding hot spot is a small area at a protein-protein interface that can make significant contribution to binding free energy. This work investigates the substantial contribution made by some special co-occurring atomic contacts at a binding hot spot. A co-occurring atomic contact is a pair of atomic contacts that are close to each other with no more than three covalent-bond steps. We found that two kinds of co-occurring atomic contacts can play an important part in the accurate prediction of binding hot spot residues. One is the co-occurrence of two nearby hydrogen bonds. For example, mutations of any residue in a hydrogen bond network consisting of multiple co-occurring hydrogen bonds could disrupt the interaction considerably. The other kind of co-occurring atomic contact is the co-occurrence of a hydrophobic carbon contact and a contact between a hydrophobic carbon atom and a ring. In fact, this co-occurrence signifies the collective effect of hydrophobic contacts. We also found that the B-factor measurements of several specific groups of amino acids are useful for the prediction of hot spots. Taking the B-factor, individual atomic contacts and the co-occurring contacts as features, we developed a new prediction method and thoroughly assessed its performance via cross-validation and independent dataset test. The results show that our method achieves higher prediction performance than well-known methods such as Robetta, FoldX and Hotpoint. We conclude that these contact descriptors, in particular the novel co-occurring atomic contacts, can be used to facilitate accurate and interpretable characterization of protein binding hot spots.
Xie, C., Zhang, J., Li, R., Li, J., Hong, P., Xia, J. & Chen, P. 2015, 'Automatic classification for field crop insects via multiple-task sparse representation and multiple-kernel learning', Computers and Electronics in Agriculture, vol. 119, pp. 123-132.
View/Download from: Publisher's site
© 2015 Elsevier B.V. Classification of insect species of field crops such as corn, soybeans, wheat, and canola is more difficult than the generic object classification because of high appearance similarity among insect species. To improve the classification accuracy, we develop an insect recognition system using advanced multiple-task sparse representation and multiple-kernel learning (MKL) techniques. As different features of insect images contribute differently to the classification of insect species, the multiple-task sparse representation technique can combine multiple features of insect species to enhance the recognition performance. Instead of using hand-crafted descriptors, our idea of sparse-coding histograms is adopted to represent insect images so that raw features (e.g., color, shape, and texture) can be well quantified. Furthermore, the MKL method is proposed to fuse multiple features effectively. The proposed learning model can be optimized efficiently by jointly optimizing the kernel weights. Experimental results on 24 common pest species of field crops show that our proposed method performs well on the classification of insect species, and outperforms the state-of-the-art methods of the generic insect categorization.
Hasan, M.M., Zhou, Y., Lu, X., Li, J., Song, J. & Zhang, Z. 2015, 'Computational identification of protein pupylation sites by using profile-based composition of k-spaced amino acid pairs', PLoS ONE, vol. 10, no. 6.
View/Download from: Publisher's site
© 2015 Hasan et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Prokaryotic proteins are regulated by pupylation, a type of post-translational modification that contributes to cellular function in bacterial organisms. In pupylation process, the prokaryotic ubiquitin-like protein (Pup) tagging is functionally analogous to ubiquitination in order to tag target proteins for proteasomal degradation. To date, several experimental methods have been developed to identify pupylated proteins and their pupylation sites, but these experimental methods are generally laborious and costly. Therefore, computational methods that can accurately predict potential pupylation sites based on protein sequence information are highly desirable. In this paper, a novel predictor termed as pbPUP has been developed for accurate prediction of pupylation sites. In particular, a sophisticated sequence encoding scheme [i.e. the profile-based composition of k-spaced amino acid pairs (pbCKSAAP)] is used to represent the sequence patterns and evolutionary information of the sequence fragments surrounding pupylation sites. Then, a Support Vector Machine (SVM) classifier is trained using the pbCKSAAP encoding scheme. The final pbPUP predictor achieves an AUC value of 0.849 in10-fold cross-validation tests and outperforms other existing predictors on a comprehensive independent test dataset. The proposed method is anticipated to be a helpful computational resource for the prediction of pupylation sites. The web server and curated datasets in this study are freely available at http://protein.cau.edu.cn/pbPUP/.
Ren, J., Liu, Q., Ellis, J. & Li, J. 2015, 'Positive-unlabeled learning for the prediction of conformational B-cell epitopes', BMC Bioinformatics, vol. 16, no. 18.
View/Download from: Publisher's site
© 2015 Ren et al. Background: The incomplete ground truth of training data of B-cell epitopes is a demanding issue in computational epitope prediction. The challenge is that only a small fraction of the surface residues of an antigen are confirmed as antigenic residues (positive training data); the remaining residues are unlabeled. As some of these uncertain residues can possibly be grouped to form novel but currently unknown epitopes, it is misguided to unanimously classify all the unlabeled residues as negative training data following the traditional supervised learning scheme. Results: We propose a positive-unlabeled learning algorithm to address this problem. The key idea is to distinguish between epitope-likely residues and reliable negative residues in unlabeled data. The method has two steps: (1) identify reliable negative residues using a weighted SVM with a high recall; and (2) construct a classification model on the positive residues and the reliable negative residues. Complex-based 10-fold cross-validation was conducted to show that this method outperforms those commonly used predictors DiscoTope 2.0, ElliPro and SEPPA 2.0 in every aspect. We conducted four case studies, in which the approach was tested on antigens of West Nile virus, dihydrofolate reductase, beta-lactamase, and two Ebola antigens whose epitopes are currently unknown. All the results were assessed on a newly-established data set of antigen structures not bound by antibodies, instead of on antibody-bound antigen structures. These bound structures may contain unfair binding information such as bound-state B-factors and protrusion index which could exaggerate the epitope prediction performance. Source codes are available on request.
Liu, Q., Song, R. & Li, J. 2015, 'Inference of gene interaction networks using conserved subsequential patterns from multiple time course gene expression datasets', BMC Genomics, vol. 16, no. 12.
View/Download from: Publisher's site
© 2015 Liu et al. Motivation: Deciphering gene interaction networks (GINs) from time-course gene expression (TCGx) data is highly valuable to understand gene behaviors (e.g., activation, inhibition, time-lagged causality) at the system level. Existing methods usually use a global or local proximity measure to infer GINs from a single dataset. As the noise contained in a single data set is hardly self-resolved, the results are sometimes not reliable. Also, these proximity measurements cannot handle the co-existence of the various in vivo positive, negative and time-lagged gene interactions. Methods and results: We propose to infer reliable GINs from multiple TCGx datasets using a novel conserved subsequential pattern of gene expression. A subsequential pattern is a maximal subset of genes sharing positive, negative or time-lagged correlations of one expression template on their own subsets of time points. Based on these patterns, a GIN can be built from each of the datasets. It is assumed that reliable gene interactions would be detected repeatedly. We thus use conserved gene pairs from the individual GINs of the multiple TCGx datasets to construct a reliable GIN for a species. We apply our method on six TCGx datasets related to yeast cell cycle, and validate the reliable GINs using protein interaction networks, biopathways and transcription factor-gene regulations. We also compare the reliable GINs with those GINs reconstructed by a global proximity measure Pearson correlation coefficient method from single datasets. It has been demonstrated that our reliable GINs achieve much better prediction performance especially with much higher precision. The functional enrichment analysis also suggests that gene sets in a reliable GIN are more functionally significant. Our method is especially useful to decipher GINs from multiple TCGx datasets related to less studied organisms where little knowledge is available except gene expression data.
Liu, Q., Chen, Y.P. & Li, J. 2014, 'k-Partite cliques of protein interactions: A novel subgraph topology for functional coherence analysis on PPI networks', Journal of Theoretical Biology, vol. 340, pp. 146-154.
View/Download from: UTS OPUS or Publisher's site
Ren, J., liu, Q., Ellis, J. & Li, J. 2014, 'Tertiary structure-based prediction of conformational B-cell epitopes through B factors', International Conference on Intelligent Systems for Molecular Biology (ISMB), pp. 264-273.
View/Download from: UTS OPUS or Publisher's site
Ren, J., Ellis, J.T. & Li, J. 2014, 'Influenza A HA's conserved epitopes and broadly neutralizing antibodies: a prediction method', Journal of Bioinformatics and Computational Biology, vol. 12, no. 5.
View/Download from: UTS OPUS or Publisher's site
Liu, Q., Hoi, S.C., Kwoh, C., Wong, L. & Li, J. 2014, 'Integrating water exclusion theory into beta contacts to predict binding free energy changes and binding hot spots', BMC Bioinformatics, vol. 15, no. 57.
View/Download from: UTS OPUS
Li, C., Chen, P., Wang, R., Wang, X., Su, Y. & Li, J. 2014, 'PPI-IRO: A Two-Stage Method for Protein-Protein Interaction Extraction Based on Interaction Relation Ontology', International Journal of Data Mining and Bioinformatics, vol. 10, no. 1, pp. 98-119.
View/Download from: UTS OPUS or Publisher's site
Zhao, L., Hoi, S.C., Li, Z., Wong, L., Nguyen, H. & Li, J. 2014, 'Coupling Graphs, Efficient Algorithms and B-cell Epitope Prediction', IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 11, no. 1, pp. 7-16.
View/Download from: UTS OPUS or Publisher's site
Song, R., Liu, Q., Hutvagner, G., Nguyen, H., Ramamohanarao, K., Wong, L. & Li, J. 2014, 'Rule discovery and distance separation to detect reliable miRNA biomarkers for the diagnosis of lung squamous cell carcinoma', BMC GENOMICS, vol. 15.
View/Download from: UTS OPUS or Publisher's site
Xu, C., Liu, Y.-.J., Sun, Q., Li, J. & He, Y. 2014, 'Polyline-sourced Geodesic Voronoi Diagrams on Triangle Meshes', Computer Graphics Forum, vol. 33, no. 7, pp. 161-170.
View/Download from: Publisher's site
Liu, Q., Li, Z. & Li, J. 2014, 'Use B-factor related features for accurate classification between protein binding interfaces and crystal packing contacts', BMC Bioinformatics, vol. 15, no. Suppl 16, pp. S3-S3.
View/Download from: UTS OPUS or Publisher's site
Zhou, Y., Tang, M., Pan, W., Li, J., Wang, W., Shao, J., Wu, L., Li, J., Yang, Q. & Yan, B. 2014, 'Bird flu outbreak prediction via satellite tracking', IEEE Intelligent Systems, vol. 29, no. 4, pp. 10-17.
View/Download from: UTS OPUS or Publisher's site
© 2001-2011 IEEE. Advanced satellite tracking technologies have collected huge amounts of wild bird migration data. Biologists use these data to understand dynamic migration patterns, study correlations between habitats, and predict global spreading trends of avian influenza. The research discussed here transforms the biological problem into a machine learning problem by converting wild bird migratory paths into graphs. H5N1 outbreak prediction is achieved by discovering weighted closed cliques from the graphs using the mining algorithm High-wEight cLosed cliquE miNing (HELEN). The learning algorithm HELEN-p then predicts potential H5N1 outbreaks at habitats. This prediction method is more accurate than traditional methods used on a migration dataset obtained through a real satellite bird-tracking system. Empirical analysis shows that H5N1 spreads in a manner of high-weight closed cliques and frequent cliques.
Li, J.Y. & Wang, L. 2014, 'Therapeutic effect of erythropoietin gene-modified bone marrow mesenchymal stem cell transplantation on rat cerebral infarction', Chinese Journal of Tissue Engineering Research, vol. 18, no. 23, pp. 3664-3669.
View/Download from: Publisher's site
Background: Previous studies have shown that erythropoietin can protect neurons and promote nerve regeneration. Objective: To explore the therapeutic effect of erythropoietin gene-modified bone marrow mesenchymal stem cell transplantation via caudal vein on rat cerebral infarction. Methods: Western blot assay was used to identify the expression of exogenous erythropoietin in bone marrow mesenchymal stem cells. A model of middle cerebral artery occlusion was established in Wistar rats using thread method. And then, model rats were randomly divided into model group (PBS injection via the caudal vein), transplantation group (transplantation of bone marrow mesenchymal stem cell suspension), erythropoietin group (transplantation of erythropoietin-transfected bone marrow mesenchymal stem cell suspension). Neurologic function was assessed at 3 days, 1, 2, 3, 4 weeks after cell transplantation. Four weeks after transplantation, the rats were decapitated after anesthesia to take brain tissues for RT-PCR detection of Bcl-2/Bax gene expression. Cell apoptosis was measured by TUNEL. Hematoxylin-eosin staining and fluorescence microscopy were employed to observe the survival and distribution of PKH26-labeled bone marrow mesenchymal stem cells. Results And Conclusion: Western blot results showed that erythropoietin-transfected bone marrow mesenchymal stem cells could express the erythropoietin in vitro. At 1, 2, 3, 4 weeks after transplantation, the neurological defect scores in the transplantation group and erythropoietin group were significantly lower than those in the model group (P < 0.05, P < 0.01). The expression of bcl-2 gene in the infarct region was significantly higher in the erythropoietin group than the transplantation and model groups (P < 0.05), but the expression of bax was significantly decreased (P < 0.05). In the erythropoietin group, the number of apoptotic cells was reduced, and the number of PKH26 positive cells was increased as compared with the other two...
Chen, P., Li, J., Wong, L., Kuwahara, H., Huang, J. & Gao, X. 2013, 'Accurate prediction of hot spot residues through physicochemical characteristics of amino acid sequences', Proteins-Structure Function And Bioinformatics, vol. 81, no. 8, pp. 1351-1362.
View/Download from: UTS OPUS or Publisher's site
Hot spot residues of proteins are fundamental interface residues that help proteins perform their functions. Detecting hot spots by experimental methods is costly and time-consuming. Sequential and structural information has been widely used in the compu
Liu, Q., Kwok, C.Y. & Li, J. 2013, 'Binding Affinity Prediction for Protein-Ligand Complexes Based on Contacts and B Factor', Journal of Chemical Information and Modeling, vol. 53, no. 11, pp. 3076-3085.
View/Download from: UTS OPUS or Publisher's site
Accurate determination of proteinligand binding affinity is a fundamental problem in biochemistry useful for many applications including drug design and proteinligand docking. A number of scoring functions have been proposed for the prediction of proteinligand binding affinity. However, accurate prediction is still a challenging problem because poor performance is often seen in the evaluation under the leave-one-cluster-out cross-validation (LCOCV). We introduce a new scoring function named B2BScore to improve the prediction performance. B2BScore integrates two physicochemical properties for proteinligand binding affinity prediction. One is the property of contacts. A contact between two atoms requires no other atoms to interrupt the atomic contact and assumes that the two atoms should have enough direct contact area. The other is the property of B factor to capture the atomic mobility in the dynamic proteinligand binding process.
Li, Z., He, Y., Liu, Q., Zhao, L., Wong, L., Kwok, C.Y., Nguyen, H.T. & Li, J. 2013, 'Structural analysis on mutation residues and interfacial water molecules for human TIM disease understanding', BMC Bioinformatics, vol. 14, no. S16, pp. 1-15.
View/Download from: UTS OPUS or Publisher's site
Background Human triosephosphate isomerase (HsTIM) deficiency is a genetic disease caused often by the pathogenic mutation E104D. This mutation, located at the side of an abnormally large cluster of water in the inter-subunit interface, reduces the thermostability of the enzyme. Why and how these water molecules are directly related to the excessive thermolability of the mutant have not been investigated in structural biology. Results This work compares the structure of the E104D mutant with its wild type counterparts. It is found that the water topology in the dimer interface of HsTIM is atypical, having a "wet-core-dry-rim" distribution with 16 water molecules tightly packed in a small deep region surrounded by 22 residues including GLU104. These water molecules are co-conserved with their surrounding residues in non-archaeal TIMs (dimers) but not conserved across archaeal TIMs (tetramers), indicating their importance in preserving the overall quaternary structure. As the structural permutation induced by the mutation is not significant, we hypothesize that the excessive thermolability of the E104D mutant is attributed to the easy propagation of atoms' flexibility from the surface into the core via the large cluster of water. It is indeed found that the B factor increment in the wet region is higher than other regions, and, more importantly, the B factor increment in the wet region is maintained in the deeply buried core. Molecular dynamics simulations revealed that for the mutant structure at normal temperature, a clear increase of the root-mean-square deviation is observed for the wet region contacting with the large cluster of interfacial water. Such increase is not observed for other interfacial regions or the whole protein. This clearly suggests that, in the E104D mutant, the large water cluster is responsible for the subunit interface flexibility and overall thermolability, and it ultimately leads to the deficiency of this enzyme.
Li, J.Y. & Zhang, Z.C. 2013, 'Effect of bone marrow mesenchymal stem cells combined with Danhong injection on expression of GAP-43 and Bcl-2 after cerebral infarction', Chinese Journal of Tissue Engineering Research, vol. 17, no. 32, pp. 5871-5876.
View/Download from: Publisher's site
Background: Danhong injection, scavenging free radicals and inhibiting lipid peroxidation, can improve microenvironment injury after cerebral infarction. Objective: To explore the influence of bone marrow mesenchymal stem cells combined with Danhong injection on expression of GAP-43 and Bcl-2 after cerebral infarction in rats. Methods: Sixty Wistar rats were selected to prepare models of cerebral infarction by middle cerebral artery occlusion and then randomly divided into control group, bone marrow mesenchymal stem cell group, and combination group. Control group received tail vein injection of PBS. Bone marrow mesenchymal stem cell group received tail vein injection of 2.5109/L bone marrow mesenchymal stem cell suspension. Combination group received injection of 2.5 109/L bone marrow mesenchymal stem cell suspension+2 mL/kg Danhong injection, for 5 consecutive days, once a day. Results and Conclusion: There were no significant differences in the neurological dysfunction scores among the three groups at 24 hour and 3 days after implantation (P > 0.05). The neurological dysfunction scores in the ombination group were significantly lower than those in the bone marrow mesenchymal stem cell group and control group at 1 and 2 weeks after transplantation (P < 0.05). In the combination group, GAP-43 and Bcl-2 expression was significantly higher than the bone marrow mesenchymal stem cell group and control group (P < 0.05). Bone marrow mesenchymal stem cell transplantation combined with Danhong injection can significantly promote the local expression of GAP-43 and Bcl-2 after cerebral infarction, and has obvious inhibitory effects on cell apoptosis in rats with cerebral infarction.
Chen, P., Wong, L. & Li, J. 2012, 'Detection Of Outlier Residues For Improving Interface Prediction In Protein Heterocomplexes', Ieee-Acm Transactions On Computational Biology And Bioinformatics, vol. 9, no. 4, pp. 1155-1165.
View/Download from: UTS OPUS or Publisher's site
Sequence-based understanding and identification of protein binding interfaces is a challenging research topic due to the complexity in protein systems and the imbalanced distribution between interface and noninterface residues. This paper presents an out
Li, Z., He, Y., Wong, L. & Li, J. 2012, 'Progressive Dry-Core-Wet-Rim Hydration Trend In A Nested-Ring Topology Of Protein Binding Interfaces', BMC Bioinformatics, vol. 13, pp. 1-16.
View/Download from: UTS OPUS or Publisher's site
Background: Water is an integral part of protein complexes. It shapes protein binding sites by filling cavities and it bridges local contacts by hydrogen bonds. However, water molecules are usually not included in protein interface models in the past, an
Zhao, L., Wong, L., Lu, L., Hoi, S.C. & Li, J. 2012, 'B-cell epitope prediction through a graph model', BMC Bioinformatics, vol. 13, no. S17, pp. 1-12.
View/Download from: UTS OPUS or Publisher's site
Background Prediction of B-cell epitopes from antigens is useful to understand the immune basis of antibody-antigen recognition, and is helpful in vaccine design and drug development. Tremendous efforts have been devoted to this long-studied problem, however, existing methods have at least two common limitations. One is that they only favor prediction of those epitopes with protrusive conformations, but show poor performance in dealing with planar epitopes. The other limit is that they predict all of the antigenic residues of an antigen as belonging to one single epitope even when multiple non-overlapping epitopes of an antigen exist. Results In this paper, we propose to divide an antigen surface graph into subgraphs by using a Markov Clustering algorithm, and then we construct a classifier to distinguish these subgraphs as epitope or non-epitope subgraphs. This classifier is then taken to predict epitopes for a test antigen. On a big data set comprising 92 antigen-antibody PDB complexes, our method significantly outperforms the state-of-the-art epitope prediction methods, achieving 24.7% higher averaged f-score than the best existing models. In particular, our method can successfully identify those epitopes with a non-planarity which is too small to be addressed by the other models. Our method can also detect multiple epitopes whenever they exist.
Zhao, L., Hoi, S.C., Wong, L., Hamp, T. & Li, J. 2012, 'Structural and Functional Analysis of Multi-Interface Domains', PLoS One, vol. 7, no. 12, pp. 1-13.
View/Download from: UTS OPUS or Publisher's site
A multi-interface domain is a domain that can shape multiple and distinctive binding sites to contact with many other domains, forming a hub in domain-domain interaction networks. The functions played by the multiple interfaces are usually different, but there is no strict bijection between the functions and interfaces as some subsets of the interfaces play the same function. This work applies graph theory and algorithms to discover fingerprints for the multiple interfaces of a domain and to establish associations between the interfaces and functions, based on a huge set of multi-interface proteins from PDB. We found that about 40% of proteins have the multi-interface property, however the involved multi-interface domains account for only a tiny fraction (1.8%) of the total number of domains. The interfaces of these domains are distinguishable in terms of their fingerprints, indicating the functional specificity of the multiple interfaces in a domain. Furthermore, we observed that both cooperative and distinctive structural patterns, which will be useful for protein engineering, exist in the multiple interfaces of a domain
Liu, Q., Wong, L. & Li, J. 2012, 'Z-score biological significance of binding hot spots of protein interfaces by using crystal packing as the reference state', BBA - Proteins and Proteomics, vol. 1824, no. 12, pp. 1457-1467.
View/Download from: UTS OPUS or Publisher's site
Characterization of binding hot spots of protein interfaces is a fundamental study in molecular biology. Many computational methods have been proposed to identify binding hot spots. However, there are few studies to assess the biological significance of binding hot spots. We introduce the notion of biological significance of a contact residue for capturing the probability of the residue occurring in or contributing to protein binding interfaces. We take a statistical Z-score approach to the assessment of the biological significance. The method has three main steps. First, the potential score of a residue is defined by using a knowledge-based potential function with relative accessible surface area calculations. A null distribution of this potential score is then generated from artifact crystal packing contacts. Finally, the Z-score significance of a contact residue with a specific potential score is determined according to this null distribution. We hypothesize that residues at binding hot spots have big absolute values of Z-score as they contribute greatly to binding free energy. Thus, we propose to use Z-score to predict whether a contact residue is a hot spot residue. Comparison with previously reported methods on two benchmark datasets shows that this Z-score method is mostly superior to earlier methods. This article is part of a Special Issue entitled: Computational Methods for Protein Interaction and Structural Prediction.
Li, Z., He, Y., Cao, L., Wong, L. & Li, J. 2012, 'Conservation of water molecules in protein binding interfaces', International Journal of Bioinformatics Research and Applications, vol. 8, no. 3/4, pp. 228-244.
View/Download from: UTS OPUS or Publisher's site
The conservation of interfacial water molecules has only been studied in small data sets consisting of interfaces of a specific function. So far, no general conclusions have been drawn from largescale analysis, due to the challenges of using structural alignment in large data sets. To avoid using structural alignment, we propose a solvated sequence method to analyse water conservation properties in protein interfaces. We first use water information to label the residues, and then align interfacial residues in a fashion similar to normal sequence alignment. Our results show that, for a watercontacting interfacial residue, substituting it into hydrophobic residues tends to desolvate the local area. Surprisingly, residues with short side chains also tend not to lose their contacting water, emphasising the role of water in shaping binding sites. Deeply buried water molecules are found more conserved in terms of their contacts with interfacial residues
Li, Y. & Li, J. 2012, 'Disease gene identification by random walk on multigraphs merging heterogeneous genomic and phenotype data', BMC Genomics, vol. 13, no. suppl 7, pp. 1-12.
View/Download from: UTS OPUS or Publisher's site
Background High throughput experiments resulted in many genomic datasets and hundreds of candidate disease genes. To discover the real disease genes from a set of candidate genes, computational methods have been proposed and worked on various types of genomic data sources. As a single source of genomic data is prone of bias, incompleteness and noise, integration of different genomic data sources is highly demanded to accomplish reliable disease gene identification. Results In contrast to the commonly adapted data integration approach which integrates separate lists of candidate genes derived from the each single data sources, we merge various genomic networks into a multigraph which is capable of connecting multiple edges between a pair of nodes. This novel approach provides a data platform with strong noise tolerance to prioritize the disease genes. A new idea of random walk is then developed to work on multigraphs using a modified step to calculate the transition matrix. Our method is further enhanced to deal with heterogeneous data types by allowing cross-walk between phenotype and gene networks. Compared on benchmark datasets, our method is shown to be more accurate than the state-of-the-art methods in disease gene identification. We also conducted a case study to identify disease genes for Insulin-Dependent Diabetes Mellitus. Some of the newly identified disease genes are supported by recently published literature.
Li, Z., He, Y., Wong, L. & Li, J. 2012, 'A dry-core-wet-rim hydration pattern in protein binding interfaces', BMC Bioinformatics, vol. 13, no. 51, pp. 1-16.
View/Download from: UTS OPUS or Publisher's site
Background Water is an integral part of protein complexes. It shapes protein binding sites by filling cavities and it bridges local contacts by hydrogen bonds. However, water molecules are usually not included in protein interface models in the past, and few distribution profiles of water molecules in protein binding interfaces are known. Results In this work, we use a tripartite protein-water-protein interface model and a nested-ring atom re-organization method to detect hydration trends and patterns from an interface data set which involves immobilized interfacial water molecules. This data set consists of 206 obligate interfaces, 160 non-obligate interfaces, and 522 crystal packing contacts. The two types of biological interfaces are found to be drier than the crystal packing interfaces in our data, agreeable to a hydration pattern reported earlier although the previous definition of immobilized water is pure distance-based. The biological interfaces in our data set are also found to be subject to stronger water exclusion in their formation. To study the overall hydration trend in protein binding interfaces, atoms at the same burial level in each tripartite protein-water-protein interface are organized into a ring. The rings of an interface are then ordered with the core atoms placed at the middle of the structure to form a nested-ring topology. We find that water molecules on the rings of an interface are generally configured in a dry-core-wet-rim pattern with a progressive level-wise solvation towards to the rim of the interface. This solvation trend becomes even sharper when counterexamples are separated.
Zhao, L., Wong, L., Lu, L., Hoi, S.C. & Li, J. 2012, 'B-cell epitope prediction through a graph model', BMC Bioinformatics, vol. 13, no. suppl 17, p. S20.
View/Download from: UTS OPUS or Publisher's site
Background Prediction of B-cell epitopes from antigens is useful to understand the immune basis of antibody-antigen recognition, and is helpful in vaccine design and drug development. Tremendous efforts have been devoted to this long-studied problem, however, existing methods have at least two common limitations. One is that they only favor prediction of those epitopes with protrusive conformations, but show poor performance in dealing with planar epitopes. The other limit is that they predict all of the antigenic residues of an antigen as belonging to one single epitope even when multiple non-overlapping epitopes of an antigen exist. Results In this paper, we propose to divide an antigen surface graph into subgraphs by using a Markov Clustering algorithm, and then we construct a classifier to distinguish these subgraphs as epitope or non-epitope subgraphs. This classifier is then taken to predict epitopes for a test antigen. On a big data set comprising 92 antigen-antibody PDB complexes, our method significantly outperforms the state-of-the-art epitope prediction methods, achieving 24.7% higher averaged f-score than the best existing models. In particular, our method can successfully identify those epitopes with a non-planarity which is too small to be addressed by the other models. Our method can also detect multiple epitopes whenever they exist.
Sim, K., Liu, G., Gopalkrishnan, V. & Li, J. 2011, 'A Case Study On Financial Ratios Via Cross-graph Quasi-bicliques', Information Sciences, vol. 181, no. 1, pp. 201-216.
View/Download from: UTS OPUS or Publisher's site
Stocks with similar financial ratio values across years have similar price movements. We investigate this hypothesis by clustering groups of stocks that exhibit homogeneous financial ratio values across years, and then study their price movements. We pro
Zeng, T., Li, J. & Liu, J. 2011, 'Distinct Interfacial Biclique Patterns Between Ssdna-binding Proteins And Those With Dsdnas', Proteins-structure Function And Bioinformatics, vol. 79, no. 2, pp. 598-610.
View/Download from: UTS OPUS or Publisher's site
We introduce a new motif called interfacial biclique pattern to study the difference between double-stranded DNA-binding proteins (DSBs, most of them also known to play the role as transcriptional factors) and single-stranded DNA-binding proteins (SSBs)
Lo, D., Li, J., Wong, L. & Khoo, S. 2011, 'Mining Iterative Generators And Representative Rules For Software Specification Discovery', Ieee Transactions On Knowledge And Data Engineering, vol. 23, no. 2, pp. 282-296.
View/Download from: UTS OPUS or Publisher's site
Billions of dollars are spent annually on software-related cost. It is estimated that up to 45 percent of software cost is due to the difficulty in understanding existing systems when performing maintenance tasks (i.e., adding features, removing bugs, et
Zhao, L., Wong, L. & Li, J. 2011, 'Antibody-specified B-cell Epitope Prediction In Line With The Principle Of Context-awareness', IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 8, no. 6, pp. 1483-1494.
View/Download from: UTS OPUS or Publisher's site
Context-awareness is a characteristic in the recognition between antigens and antibodies, highlighting the reconfiguration of epitope residues when an antigen interacts with a different antibody. A coarse binary classification of antigen regions into epi
Tang, M., Zhou, Y., Li, J., Wang, W., Cui, P., Hou, Y., Luo, Z., Li, J., Lei, F. & Yan, B. 2011, 'Exploring The Wild Birds' Migration Data For The Disease Spread Study Of H5N1: A Clustering And Association Approach', Knowledge And Information Systems, vol. 27, no. 2, pp. 227-251.
View/Download from: UTS OPUS or Publisher's site
Knowledge about the wetland use of migratory bird species during the annual life circle is very interesting to biologists, as it is critically important in many decision-making processes such as for conservation site construction and avian influenza control. The raw data of the habitat areas and the migration routes are usually in large scale and with high complexity when they are determined by high-tech GPS satellite telemetry. In this paper, we convert these biological problems into computational studies and introduce efficient algorithms for the data analysis. Our key idea is the concept of hierarchical clustering for migration habitat localizations, and the notion of association rules for the discovery of migration routes from the scattered location points in the GIS. One of our clustering results is a tree structure, specially called spatial-tree, which is an illusive map depicting the breeding and wintering home range of bar-headed geese. A related result to this observation is an association pattern that reveals a high possibility that bar-headed geese's potential autumn migration routes are likely between the breeding sites in the Qinghai Lake, China and the wintering sites in Tibet river valley. Given the susceptibility of geese to spread H5N1, and on the basis of the chronology and the rates of the bar-headed geese migration movements, we can conjecture that bar-headed geese play an important role in the spread of the H5N1 virus at a regional scale in Qinghai-Tibetan Plateau.
Li, Z., Wong, L. & Li, J. 2011, 'DBAC: A Simple Prediction Method For Protein Binding Hot Spots Based On Burial Levels And Deeply Buried Atomic Contacts', BMC SYSTEMS BIOLOGY, vol. 5, no. S1, pp. 1-11.
View/Download from: UTS OPUS or Publisher's site
Background: A protein binding hot spot is a cluster of residues in the interface that are energetically important for the binding of the protein with its interaction partner. Identifying protein binding hot spots can give useful information to protein en
Liu, Q., Hoi, S., Su, C., Li, Z., Kwoh, C., Wong, L. & Li, J. 2011, 'Structural Analysis Of The Hot Spots In The Binding Between H1N1 HA And The 2d1 Antibody: Do Mutations Of H1N1 From 1918 To 2009 Affect Much On This Binding?', Bioinformatics, vol. 27, no. 18, pp. 2529-2536.
View/Download from: UTS OPUS or Publisher's site
Motivation: Worldwide and substantial mortality caused by the 2009 H1N1 influenza A has stimulated a new surge of research on H1N1 viruses. An epitope conservation has been learned in the HA1 protein that allows antibodies to cross-neutralize both 1918 a
Feng, M., Dong, G., Li, J., Tan, Y. & Wong, L. 2010, 'Pattern Space Maintenance For Data Updates And Interactive Mining', Computational Intelligence, vol. 26, no. 3, pp. 282-317.
View/Download from: UTS OPUS or Publisher's site
This article addresses the incremental and decremental maintenance of the frequent pattern space. We conduct an in-depth investigation on how the frequent pattern space evolves under both incremental and decremental updates. Based on the evolution analys
Chen, P., Liu, C., Burge, L., Li, J., Mohammad, M., Southerland, W., Gloster, C. & Wang, B. 2010, 'DomSVR: Domain Boundary Prediction With Support Vector Regression From Sequence Information Alone', Amino Acids, vol. 39, no. 3, pp. 713-726.
View/Download from: UTS OPUS or Publisher's site
Protein domains are structural and fundamental functional units of proteins. The information of protein domain boundaries is helpful in understanding the evolution, structures and functions of proteins, and also plays an important role in protein classif
Zhao, L. & Li, J. 2010, 'Mining For The Antibody-antigen Interacting Associations That Predict The B Cell Epitopes', Bmc Structural Biology, vol. 10, no. Suppl.1, pp. 1-13.
View/Download from: UTS OPUS or Publisher's site
Background: Predicting B-cell epitopes is very important for designing vaccines and drugs to fight against the infectious agents. However, due to the high complexity of this problem, previous prediction methods that focus on linear and conformational epi
Chen, P. & Li, J. 2010, 'Sequence-based Identification Of Interface Residues By An Integrative Profile Combining Hydrophobic And Evolutionary Information', Bmc Bioinformatics, vol. 11, pp. 0-0.
View/Download from: UTS OPUS or Publisher's site
Background: Protein-protein interactions play essential roles in protein function determination and drug design. Numerous methods have been proposed to recognize their interaction sites, however, only a small proportion of protein complexes have been suc
Chen, P. & Li, J. 2010, 'Prediction of protein long-range contacts using an ensemble of genetic algorithm classifiers with sequence profile centers', Bmc Structural Biology, vol. 10, no. Suppl. 1, pp. 1-13.
View/Download from: UTS OPUS or Publisher's site
Background: Prediction of long-range inter-residue contacts is an important topic in bioinformatics research. It is helpful for determining protein structures, understanding protein foldings, and therefore advancing the annotation of protein functions. R
Li, Z. & Li, J. 2010, 'Geometrically centered region: A "wet" model of protein binding hot spots not excluding water molecules', Proteins-Structure Function And Bioinformatics, vol. 78, no. 16, pp. 3304-3316.
View/Download from: UTS OPUS or Publisher's site
A protein interface can be as 'wet' as a protein surface in terms of the number of immobilized water molecules. This important water information has not been explicitly taken by computational methods to model and identify protein binding hot spots, overl
Liu, X., Li, J. & Wang, L. 2010, 'Modeling Protein Interacting Groups By Quasi-bicliques: Complexity, Algorithm, And Application', IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 7, no. 2, pp. 354-364.
View/Download from: UTS OPUS or Publisher's site
Protein-protein interactions (PPIs) are one of the most important mechanisms in cellular processes. To model protein interaction sites, recent studies have suggested to find interacting protein group pairs from large PPI networks at the first step and th
LUO, F., LIU, J. & Li, J. 2010, 'Discovering conditional co-regulated protein complexes by integrating diverse data sources', BMC Systems Biology, vol. 4, no. Suppl. 2, pp. 1-13.
View/Download from: UTS OPUS or Publisher's site
Background: Proteins interacting with each other as a complex play an important role in many molecular processes and functions. Directly detecting protein complexes is still costly, whereas many protein-protein interaction (PPI) maps for model organisms are available owing to the fast development of high-throughput PPI detecting techniques. These binary PPI data provides fundamental and abundant information for inferring new protein complexes. However, PPI data from different experiments do not overlap very much usually. The main reason is that the functions of proteins can activate only on certain environment or stimulus. In a short, PPI is condition-specific. Therefore specifying the conditions on when complexes are present is necessary for a deep understanding of their behaviours. Meanwhile, proteins have various interaction ways and control mechanisms to form different kinds of complexes. Thus the discovery of a certain type of complexes should depend on their own distinct biological or topological characteristics. We do not attempt to find all kinds of complexes by using certain features. Here, we integrate transcription regulation data (TR), gene expression data (GE) and protein-protein interaction data at the systems biology level to discover a special kind of protein complex called conditional coregulated protein complexes. A conditional co-regulated protein complex has three remarkable features: the coding genes of the member proteins share the same transcription factor (TF), under a certain condition the coding genes express co-ordinately and the member proteins interact mutually as a complex to implement a common biological function
Liu, Q. & Li, J. 2010, 'Propensity Vectors Of Low-ASA Residue Pairs In The Distinction Of Protein Interactions', Proteins-structure Function And Bioinformatics, vol. 78, no. 3, pp. 589-602.
View/Download from: UTS OPUS or Publisher's site
We introduce low-ASA residue pairs as classification features for distinguishing the different types of protein interactions. A low-ASA residue pair is defined as two contact residues each from one chain that have a small solvent accessible surface area
Mann, S., Li, J. & Chen, Y. 2010, 'Insights Into Bacterial Genome Composition Through Variable Target GC Content Profiling', Journal of Computational Biology, vol. 17, no. 1, pp. 79-96.
View/Download from: UTS OPUS or Publisher's site
This study presents a new computational method for guanine (G) and cytosine (C), or GC, content profiling based on the idea of multiple resolution sampling (MRS). The benefit of our new approach over existing techniques follows from its ability to locate
Anandagopu, P., Rashid, S. & Li, J. 2010, 'Low Thymine Content in PINK1 mRNAs and Insights into Parkinson's disease', Bioinformation, vol. 4, no. 10, pp. 452-455.
View/Download from: UTS OPUS
Thymine is the only nucleotide base which is changed to uracil upon transcription, leaving mRNA less hydrophobic compared to its DNA counterpart. All the 16 codons that contain uracil (or thymine in gene) as the second nucleotide code for the five large hydrophobic residues (LHRs), namely phenylalanine,v isoleucine, leucine, methionine and valine. Thymine content (i.e. the fraction of XTX codons, where X = A, C, G, or T) in PINK1 mRNA sequences and its relationship with protein stability and function are the focus of this work. This analysis will shed light on PINK1's stability, thus a clue can be provided to understand the mitochondrial dysfunction and the failure of oxidative stress control frequently observed in Parkinson's disease. We obtained the complete PINK1 mRNA sequences of 8 different species. The distributions of XTX codons in different frames are calculated. We observed that the thymine content reached the highest level in the coding frame 1 of the PINK1 mRNA sequence of Bos Taurus (Bt), that is peaked at 27%. Coding frame 1 containing low thymine leads to the reduction in LHRs in the corresponding proteins. Therefore, we conjecture that proteins from the other organisms, including Homo sapiens, lost some of their hydrophobicity and became susceptible to dysfunction. Genes such as PINK1 have reduced thymine in the evolutionary process thereby making their protein products potentially being susceptible to instability and causing disease. Adding more hydrophobic residues (thymine) at appropriate places might help conserve important biological functions.
Zeng, T. & Li, J. 2010, 'Maximization of negative correlations in time-course gene expression data for enhancing understanding of molecular pathways.', Nucleic acids research, vol. 38, no. 1, p. e1.
Positive correlation can be diversely instantiated as shifting, scaling or geometric pattern, and it has been extensively explored for time-course gene expression data and pathway analysis. Recently, biological studies emerge a trend focusing on the notion of negative correlations such as opposite expression patterns, complementary patterns and self-negative regulation of transcription factors (TFs). These biological ideas and primitive observations motivate us to formulate and investigate the problem of maximizing negative correlations. The objective is to discover all maximal negative correlations of statistical and biological significance from time-course gene expression data for enhancing our understanding of molecular pathways. Given a gene expression matrix, a maximal negative correlation is defined as an activation-inhibition two-way expression pattern (AIE pattern). We propose a parameter-free algorithm to enumerate the complete set of AIE patterns from a data set. This algorithm can identify significant negative correlations that cannot be identified by the traditional clustering/biclustering methods. To demonstrate the biological usefulness of AIE patterns in the analysis of molecular pathways, we conducted deep case studies for AIE patterns identified from Yeast cell cycle data sets. In particular, in the analysis of the Lysine biosynthesis pathway, new regulation modules and pathway components were inferred according to a significant negative correlation which is likely caused by a co-regulation of the TFs at the higher layer of the biological network. We conjecture that maximal negative correlations between genes are actually a common characteristic in molecular pathways, which can provide insights into the cell stress response study, drug response evaluation, etc.
Liu, Q. & Li, J. 2010, 'Protein binding hot spots and the residue-residue pairing preference: a water exclusion perspective.', BMC bioinformatics, vol. 11, p. 244.
BACKGROUND: A protein binding hot spot is a small cluster of residues tightly packed at the center of the interface between two interacting proteins. Though a hot spot constitutes a small fraction of the interface, it is vital to the stability of protein complexes. Recently, there are a series of hypotheses proposed to characterize binding hot spots, including the pioneering O-ring theory, the insightful 'coupling' and 'hot region' principle, and our 'double water exclusion' (DWE) hypothesis. As the perspective changes from the O-ring theory to the DWE hypothesis, we examine the physicochemical properties of the binding hot spots under the new hypothesis and compare with those under the O-ring theory. RESULTS: The requirements for a cluster of residues to form a hot spot under the DWE hypothesis can be mathematically satisfied by a biclique subgraph if a vertex is used to represent a residue, an edge to indicate a close distance between two residues, and a bipartite graph to represent a pair of interacting proteins. We term these hot spots as DWE bicliques. We identified DWE bicliques from crystal packing contacts, obligate and non-obligate interactions. Our comparative study revealed that there are abundant unique bicliques to the biological interactions, indicating specific biological binding behaviors in contrast to crystal packing. The two sub-types of biological interactions also have their own signature bicliques. In our analysis on residue compositions and residue pairing preferences in DWE bicliques, the focus was on interaction-preferred residues (ipRs) and interaction-preferred residue pairs (ipRPs). It is observed that hydrophobic residues are heavily involved in the ipRs and ipRPs of the obligate interactions; and that aromatic residues are in favor in the ipRs and ipRPs of the biological interactions, especially in those of the non-obligate interactions. In contrast, the ipRs and ipRPs in crystal packing are dominated by hydrophilic residues, and mo...
Li, J. & Liu, Q. 2009, ''Double water exclusion': A hypothesis refining the O-ring theory for the hot spots at protein interfaces', Bioinformatics, vol. 25, no. 6, pp. 743-750.
View/Download from: UTS OPUS or Publisher's site
Motivation: The O-ring theory reveals that the binding hot spot at a protein interface is surrounded by a ring of residues that are energetically less important than the residues in the hot spot. As this ring of residues is served to occlude water molecu
Zeng, X., Pei, J., Wang, K. & Li, J. 2009, 'PADS: A simple yet effective pattern-aware dynamic search method for fast maximal frequent pattern mining', Knowledge And Information Systems, vol. 20, no. 3, pp. 375-391.
View/Download from: UTS OPUS or Publisher's site
While frequent pattern mining is fundamental for many data mining tasks, mining maximal frequent patterns efficiently is important in both theory and applications of frequent pattern mining. The fundamental challenge is how to search a large space of ite
Liu, G., Sim, K., Li, J. & Wong, L. 2009, 'Efficient Mining of Distance-Based Subspace Clusters', Statistical Analysis and Data Mining, vol. 2, no. 5-6, pp. 427-444.
View/Download from: UTS OPUS or Publisher's site
Traditional similarity measurements often become meaningless when dimensions of datasets increase. Subspace clustering has been proposed to find clusters embedded in subspaces of high-dimensional datasets. Many existing algorithms use a grid-based approach to partition the data space into nonoverlapping rectangle cells, and then identify connected dense cells as clusters. The rigid boundaries of the grid-based approach may cause a real cluster to be divided into several small clusters. In this paper, we propose to use a sliding-window approach to partition the dimensions to preserve significant clusters. We call this model nCluster model. The sliding-window approach generates more bins than the grid-based approach, thus it incurs higher mining cost. We develop a deterministic algorithm, called MaxnCluster, to mine nClusters efficiently. MaxnCluster uses several techniques to speed up the mining, and it produces only maximal nClusters to reduce result size. Non-maximal nClusters are pruned without the need of storing the discovered nClusters in the memory, which is key to the efficiency of MaxnCluster. Our experiment results show that (i) the nCluster model can indeed preserve clusters that are shattered by the grid-based approach on synthetic datasets; (ii) the nCluster model produces more significant clusters than the grid-based approach on two real gene expression datasets and (iii) MaxnCluster is efficient in mining maximal nClusters.
Sim, K., Li, J., Gopalkrishnan, V. & Liu, G. 2009, 'Mining maximal quasi-bicliques: Novel algorithm and applications in the stock market and protein networks', Statistical Analysis and Data Mining, vol. 2, no. 4, pp. 255-273.
View/Download from: UTS OPUS or Publisher's site
Several real-world applications require mining of bicliques, as they represent correlated pairs of data clusters. However, the mining quality is adversely affected by missing and noisy data. Moreover, some applications only require strong interactions between data members of the pairs, but bicliques are pairs that display complete interactions. We address these two limitations by proposing maximal quasi-bicliques. Maximal quasi-bicliques tolerate erroneous and missing data, and also relax the interactions between the data members of their pairs. Besides, maximal quasi-bicliques do not suffer from skewed distribution of missing edges that prior quasi-bicliques have. We develop an algorithm MQBminer, which mines the complete set of maximal quasi-bicliques from either bipartite or non-bipartite graphs. We demonstrate the versatility and effectiveness of maximal quasi-bicliques to discover highly correlated pairs of data in two diverse real-world datasets. First, we propose to solve a novel financial stocks analysis problem using maximal quasi-bicliques to co-cluster stocks and financial ratios. Results show that the stocks in our co-clusters usually have significant correlations in their price performance. Second, we use maximal quasi-bicliques on a mining protein network problem and we show that pairs of protein groups mined by maximal quasi-bicliques are more significant than those mined by maximal bicliques.
Kim, S., Kang, J., Chung, Y., Li, J. & Ryu, K. 2008, 'Clustering Orthologous Proteins Across Phylogenetically Distant Species', Proteins-Structure Function And Bioinformatics, vol. 71, no. 3, pp. 1113-1122.
View/Download from: UTS OPUS or Publisher's site
The quality of orthologous protein clusters (OPCs) is largely dependent on the results of the reciprocal BLAST (basic local alignment search tool) hits among genomes. The BLAST algorithm is very efficient and fast, but it is very difficult to get optimal
Liu, G., Li, J. & Wong, L. 2008, 'A New Concise Representation Of Frequent Itemsets Using Generators And A Positive Border', Knowledge And Information Systems, vol. 17, no. 1, pp. 35-56.
View/Download from: UTS OPUS or Publisher's site
A complete set of frequent itemsets can get undesirably large due to redundancy when the minimum support threshold is low or when the database is dense. Several concise representations have been previously proposed to eliminate the redundancy. Generator
Liu, G., Li, J. & Wong, L. 2008, 'Assessing and predicting protein interactions using both local and global network topological metrics.', Genome informatics. International Conference on Genome Informatics, vol. 21, pp. 138-149.
High-throughput protein interaction data, with ever-increasing volume, are becoming the foundation of many biological discoveries. However, high-throughput protein interaction data are often associated with high false positive and false negative rates. It is desirable to develop scalable methods to identify these errors. In this paper, we develop a computational method to identify spurious interactions and missing interactions from high-throughput protein interaction data. Our method uses both local and global topological information of protein pairs, and it assigns a local interacting score and a global interacting score to every protein pair. The local interacting score is calculated based on the common neighbors of the protein pairs. The global interacting score is computed using globally interacting protein group pairs. The two scores are then combined to obtain a final score called LGTweight to indicate the interacting possibility of two proteins. We tested our method on the DIP yeast interaction dataset. The experimental results show that the interactions ranked top by our method have higher functional homogeneity and localization coherence than existing methods, and our method also achieves higher sensitivity and precision under 5-fold cross validation than existing methods.
Li, J. & Yang, Q. 2007, 'Strong Compound-risk Factors: Efficient Discovery Through Emerging Patterns And Contrast Sets', IEEE Transactions on Information Technology in Biomedicine, vol. 11, no. 5, pp. 544-552.
View/Download from: UTS OPUS or Publisher's site
Odds ratio (OR), relative risk (RR) (risk ratio), and absolute risk reduction (ARR) (risk difference) are biostatistics measurements that are widely used for identifying significant risk factors in dichotomous groups of subjects. In the past, they have o
Pang, B., Kuralmani, V., Joshi, R., Yin, H., Lee, K., Ang, B., Li, J., Leong, T. & Ng, I. 2007, 'Hybrid Outcome Prediction Model For Severe Traumatic Brain Injury', Journal Of Neurotrauma, vol. 24, no. 1, pp. 136-146.
View/Download from: UTS OPUS or Publisher's site
Numerous studies addressing different methods of head injury prognostication have been published. Unfortunately, these studies often incorporate different head injury prognostication models and study populations, thus making direct comparison difficult,
Li, J., Liu, G., Li, H. & Wong, L. 2007, 'Maximal Biclique Subgraphs And Closed Pattern Pairs Of The Adjacency Matrix: A One-to-one Correspondence And Mining Algorithms', IEEE Transactions On Knowledge And Data Engineering, vol. 19, no. 12, pp. 1625-1636.
View/Download from: UTS OPUS or Publisher's site
Maximal biclique (also known as complete bipartite) subgraphs can model many applications in Web mining, business, and bioinformatics. Enumerating maximal biclique subgraphs from a graph is a computationally challenging problem, as the size of the output
Mann, S., Li, J. & Chen, Y. 2007, 'A pHMM-ANN based discriminative approach to promoter identification in prokaryote genomic contexts', Nucleic Acids Research, vol. 35, no. 2, pp. 1-7.
View/Download from: UTS OPUS
The computational approach for identifying promoters on increasingly large genomic sequences has led to many false positives. The biological significance of promoter identification lies in the ability to locate true promoters with and without prior seque
Aung, Z. & Li, J. 2007, 'Mining super-secondary structure motifs from 3d protein structures: a sequence order independent approach.', Genome informatics. International Conference on Genome Informatics, vol. 19, pp. 15-26.
Super-Secondary structure elements (super-SSEs) are the structurally conserved ensembles of secondary structure elements (SSEs) within a protein. They are of great biological interest. In this work, we present a method to formally represent and mine the sequence order independent super-SSE motifs that occur repeatedly in large data sets of protein structures. We represent a protein structure as a graph, and mine the common cliques from a set of protein graphs in order to find the motifs. We mine two categories of super-SSE motifs: the generic motifs that occur frequently across the entire database of protein structures, and the fold-preferential motifs that are concentrated in particular protein fold types. From the experimental data set of 600 proteins belonging to 15 large SCOP Folds, we have discovered 21 generic motifs and 75 fold-preferential motifs that are both statistically significant and biologically relevant. A number of the discovered motifs (both generic and fold-preferential) resemble the well-known super-SSE motifs in the literature such as beta hairpins, Greek keys, zinc fingers, etc. Some of the discovered motifs are of novel shapes that have not been documented yet. Our method is time-efficient where it can discover all the motifs across the 600 proteins in less than 14 minutes on a standalone PC. The discovered motifs are reported in our project webpage: http://www1.i2r.a-star.edu.sg/~azeyar/SuperSSE/
Li, H., Li, J. & Wong, L. 2006, 'Discovering Motif Pairs At Interaction Sites From Protein Sequences On A Proteome-wide Scale', Bioinformatics, vol. 22, no. 8, pp. 989-996.
View/Download from: UTS OPUS or Publisher's site
Motivation: Protein-protein interaction, mediated by protein interaction sites, is intrinsic to many functional processes in the cell. In this paper, we propose a novel method to discover patterns in protein interaction sites. We observed from protein in
Mann, S., Li, J. & Chen, Y.-.P.P. 2006, 'A pHMM-ANN based discriminative approach to promoter identification in prokaryote genomic contexts', Nucleic Acids Research, vol. 35, no. 2, pp. e12-e12.
View/Download from: Publisher's site
Huang, D., Chow, T., Ma, E. & Li, J. 2005, 'Efficient Selection Of Discriminative Genes From Microarray Gene Expression Data For Cancer Diagnosis', IEEE Transaction On Circuits and Systems -I: Fundamental Theory and Applications, vol. 52, no. 9, pp. 1909-1918.
View/Download from: UTS OPUS
A new mutual information (MI)-based feature-selection method to solve the so-called large p and small n problem experienced in a microarray gene expression-based data is presented. First, a grid-based feature clustering algorithm is introduced to elimina
Liu, H., Li, J. & Wong, L. 2005, 'Use Of Extreme Patient Samples For Outcome Prediction From Gene Expression Data', Bioinformatics, vol. 21, no. 16, pp. 3377-3384.
View/Download from: UTS OPUS or Publisher's site
Motivation: Patient outcome prediction using microarray technologies is an important application in bioinformatics. Based on patients' genotypic microarray data, predictions are made to estimate patients' survival time and their risk of tumor metastasis
Liu, H., Han, H., Li, J. & Wong, L. 2005, 'DNAFSMiner: A Web-based Software Toolbox To Recognize Two Types Of Functional Sites In DNA Sequences', Bioinformatics, vol. 21, no. 5, pp. 671-673.
View/Download from: UTS OPUS or Publisher's site
DNAFSMiner (DNA Functional Sites Miner) is a web-based software toolbox to recognize functional sites in nucleic acid sequences. Currently in this toolbox, we provide two software: TIS Miner and Poly(A) Signal Miner. The TIS Miner can be used to predict
Li, J. & Li, H. 2005, 'Using Fixed Point Theorems To Model The Binding In Protein-protein Interactions', IEEE Transactions On Knowledge And Data Engineering, vol. 17, no. 8, pp. 1079-1087.
View/Download from: UTS OPUS or Publisher's site
The binding in protein-protein interactions exhibits a kind of biochemical stability in cells. The mathematical notion of fixed points also describes stability. A point is a fixed point if it remains unchanged after a transformation by a function. Many p
Dong, G. & Li, J. 2005, 'Mining Border Descriptions Of Emerging Patterns From Dataset Pairs', Knowledge And Information Systems, vol. 8, no. 2, pp. 178-202.
View/Download from: UTS OPUS or Publisher's site
The mining of changes or differences or other comparative patterns from a pair of datasets is an interesting problem. This paper is focused on the mining of one type of comparative pattern called emerging patterns. Emerging patterns are denoted by EPs an
Li, J. & Wong, L. 2005, 'Structural Geography Of The Space Of Emerging Patterns', Intelligent Data Analysis, vol. 9, no. 6, pp. 567-588.
View/Download from: UTS OPUS
Describing and capturing significant differences between two classes of data is an important data mining and classification research topic. In this paper, we use emerging patterns to describe these significant differences. Such a pattern occurs in one cl
Li, H. & Li, J. 2005, 'Discovery Of Stable And Significant Binding Motif Pairs From PDB Complexes And Protein Interaction Datasets', Bioinformatics, vol. 21, no. 3, pp. 314-324.
View/Download from: UTS OPUS or Publisher's site
Motivation: Discovery of binding sites is important in the study of protein-protein interactions. In this paper, we introduce stable and significant motif pairs to model protein-binding sites. The stability is the pattern's resistance to some transformat
Li, J., Wong, L. & Yang, Q. 2005, 'Data mining in bioinformatics', IEEE Intelligent Systems, vol. 20, no. 6, pp. 16-18.
View/Download from: Publisher's site
Li, J., Manoukian, T., Dong, G. & Ramamohanarao, K. 2004, 'Incremental Maintenance On The Border Of The Space Of Emerging Patterns', Data Mining And Knowledge Discovery, vol. 9, no. 1, pp. 89-116.
View/Download from: Publisher's site
Emerging patterns (EPs) are useful knowledge patterns with many applications. In recent studies on bio-medical profiling data, we have successfully used such patterns to solve difficult cancer diagnosis problems and produced higher classification accurac
Li, J., Dong, G., Ramamohanarao, K. & Wong, L. 2004, 'Deeps: A New Instance-based Lazy Discovery And Classification System', Machine Learning, vol. 54, no. 2, pp. 99-124.
View/Download from: Publisher's site
Distance is widely used in most lazy classification systems. Rather than using distance, we make use of the frequency of an instance's subsets of features and the frequency-change rate of the subsets among training classes to perform both knowledge disco
Liu, H., Han, H., Li, J. & Wong, L. 2004, 'Using Amino Acid Patterns to Accurately Predict Translation Initiation Sites', In silico Biology, vol. 4, no. 22, pp. 1-11.
The translation initiation site (TIS) prediction problem is about how to correctly identify TIS in mRNA, cDNA, or other types of genomic sequences. High prediction accuracy can be helpful in a better understanding of protein coding from nucleotide sequences. This is an important step in genomic analysis to determine protein coding from nucleotide sequences. In this paper, we present an in silico method to predict translation initiation sites in vertebrate cDNA or mRNA sequences. This method consists of three sequential steps as follows. In the first step, candidate features are generated using k-gram amino acid patterns. In the second step, a small number of top-ranked features are selected by an entropy-based algorithm. In the third step, a classification model is built to recognize true TISs by applying support vector machines or ensembles of decision trees to the selected features. We have tested our method on several independent data sets, including two public ones and our own extracted sequences. The experimental results achieved are better than those reported previously using the same data sets. Our high accuracy not only demonstrates the feasibility of our method, but also indicates that there might be 'amino acid' patterns around TIS in cDNA and mRNA sequences.
Li, J. & Ong, H. 2004, 'Feature Space Transformation For Better Understanding Biological And Medical Classifications', Journal Of Research And Practice In Information Technology, vol. 36, no. 3, pp. 131-144.
Recently published gene expression profiles and proteomic mass/charge ratios are extremely high-dimensional data. Though support vector machines can well learn the inner relationship of the data for classification, the non-linear kernel functions pose an
Meng, S., Zhang, Z. & Li, J. 2004, 'Twelve C2h2 Zinc-finger Genes On Human Chromosome 19 Can Be Each Translated Into The Same Type Of Protein After Frameshifts', Bioinformatics, vol. 20, no. 1, pp. 1-4.
View/Download from: Publisher's site
We report a discovery that, of the 226 C2H2 zinc-finger (C2H2-ZNF) genes on human chromosome 19, 12 genes each have two open reading frames (ORFs) that are in different reading frames but that can be translated into the same type of C2H2-ZNF proteins. We
Liu, H., Li, J. & Wong, L. 2004, 'Selection of patient samples and genes for outcome prediction.', Proceedings / IEEE Computational Systems Bioinformatics Conference, CSB. IEEE Computational Systems Bioinformatics Conference, pp. 382-392.
Gene expression profiles with clinical outcome data enable monitoring of disease progression and prediction of patient survival at the molecular level. We present a new computational method for outcome prediction. Our idea is to use an informative subset of original training samples. This subset consists of only short-term survivors who died within a short period and long-term survivors who were still alive after a long follow-up time. These extreme training samples yield a clear platform to identify genes whose expression is related to survival. To find relevant genes, we combine two feature selection methods -- entropy measure and Wilcoxon rank sum test -- so that a set of sharp discriminating features are identified. The selected training samples and genes are then integrated by a support vector machine to build a prediction model, by which each validation sample is assigned a survival/relapse risk score for drawing Kaplan-Meier survival curves. We apply this method to two data sets: diffuse large-B-cell lymphoma (DLBCL) and primary lung adenocarcinoma. In both cases, patients in high and low risk groups stratified by our risk scores are clearly distinguishable. We also compare our risk scores to some clinical factors, such as International Prognostic Index score for DLBCL analysis and tumor stage information for lung adenocarcinoma. Our results indicate that gene expression profiles combined with carefully chosen learning algorithms can predict patient survival for certain diseases.
Liu, H., Han, H., Li, J. & Wong, L. 2004, 'Using amino acid patterns to accurately predict translation initiation sites.', In silico biology, vol. 4, no. 3, pp. 255-269.
The translation initiation site (TIS) prediction problem is about how to correctly identify TIS in mRNA, cDNA, or other types of genomic sequences. High prediction accuracy can be helpful in a better understanding of protein coding from nucleotide sequences. This is an important step in genomic analysis to determine protein coding from nucleotide sequences. In this paper, we present an in silico method to predict translation initiation sites in vertebrate cDNA or mRNA sequences. This method consists of three sequential steps as follows. In the first step, candidate features are generated using k-gram amino acid patterns. In the second step, a small number of top-ranked features are selected by an entropy-based algorithm. In the third step, a classification model is built to recognize true TISs by applying support vector machines or ensembles of decision trees to the selected features. We have tested our method on several independent data sets, including two public ones and our own extracted sequences. The experimental results achieved are better than those reported previously using the same data sets. Our high accuracy not only demonstrates the feasibility of our method, but also indicates that there might be "amino acid" patterns around TIS in cDNA and mRNA sequences.
Li, H., Li, J., Tan, S.H. & Ng, S.K. 2004, 'Discovery of binding motif pairs from protein complex structural data and protein interaction sequence data.', Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing, pp. 312-323.
Unravelling the underlying mechanisms of protein interactions requires knowledge about the interactions' binding sites. In this paper, we use a novel concept, binding motif pairs, to describe binding sites. A binding motif pair consists of two motifs each derived from one side of the binding protein sequences. The discovery is a directed approach that uses a combination of two data sources: 3-D structures of protein complexes and sequences of interacting proteins. We first extract maximal contact segment pairs from the protein complexes' structural data. We then use these segment pairs as templates to sub-group the interacting protein sequence dataset, and conduct an iterative refinement to derive significant binding motif pairs. This combination approach is efficient in handling large datasets of protein interactions. From a dataset of 78,390 protein interactions, we have discovered 896 significant binding motif pairs. The discovered motif pairs include many novel motif pairs as well as motifs that agree well with experimentally validated patterns in the literature.
Li, J., Liu, H., Ng, S. & Wong, L. 2003, 'Discovery Of Significant Rules For Classifying Cancer Diagnosis Data', Bioinformatics, vol. 19, no. NA, pp. 0-0.
Methods and Results: We introduce a new method to discover many diversified and significant rules from high dimensional profiling data. We also propose to aggregate the discriminating power of these rules for reliable predictions. The discovered rules ar
Li, J., Liu, H., Downing, J., Yeoh, A. & Wong, L. 2003, 'Simple Rules Underlying Gene Expression Profiles Of More Than Six Subtypes Of Acute Lymphoblastic Leukemia (all) Patients', Bioinformatics, vol. 19, no. 1, pp. 71-78.
View/Download from: Publisher's site
Motivations and Results: For classifying gene expression profiles or other types of medical data, simple rules are preferable to non-linear distance or kernel functions. This is because rules may help us understand more about the application in addition
Liu, H., Han, H., Li, J. & Wong, L. 2003, 'An in-silico method for prediction of polyadenylation signals in human sequences.', Genome informatics. International Conference on Genome Informatics, vol. 14, pp. 84-93.
This paper presents a machine learning method to predict polyadenylation signals (PASes) in human DNA and mRNA sequences by analysing features around them. This method consists of three sequential steps of feature manipulation: generation, selection and integration of features. In the first step, new features are generated using k-gram nucleotide acid or amino acid patterns. In the second step, a number of important features are selected by an entropy-based algorithm. In the third step, support vector machines are employed to recognize true PASes from a large number of candidates. Our study shows that true PASes in DNA and mRNA sequences can be characterized by different features, and also shows that both upstream and downstream sequence elements are important for recognizing PASes from DNA sequences. We tested our method on several public data sets as well as our own extracted data sets. In most cases, we achieved better validation results than those reported previously on the same data sets. The important motifs observed are highly consistent with those reported in literature.
Li, J., Liu, H., Ng, S.-.K. & Wong, L. 2003, 'Discovery of significant rules for classifying cancer diagnosis data', BIOINFORMATICS, vol. 19, pp. II93-II102.
View/Download from: Publisher's site
Li, J., Ng, S.K. & Wong, L. 2003, 'Bioinformatics adventures in database research', Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 2572, pp. 31-46.
Informatics has helped launch molecular biology into the genomic era. It appears certain that informatics will remain a major contributor to molecular biology in the post-genome era. We discuss here data integration and datamining in bioinformatics, as well as the role that database theory played in these topics. We also describe LIMS as a third key topic in bioinformatics where advances in database system and theory can be very relevant. &copy; Springer-Verlag Berlin Heidelberg 2003.
Li, J. & Wong, L. 2002, 'Identifying Good Diagnostic Gene Groups From Gene Expression Profiles Using The Concept Of Emerging Patterns', Bioinformatics, vol. 18, no. 5, pp. 725-734.
View/Download from: Publisher's site
Motivations and Results: Gene groups that are significantly related to a disease can be detected by conducting a series of gene expression experiments. This work is aimed at discovering special types of gene groups that satisfy the following property. In
Yeoh, E., Ross, M., Shurtleff, S., Williams, W., Patel, D., Mahfouz, R., Behm, F., Raimondi, S., Relling, M., Patel, A., Cheng, C., Campana, D., Wilkins, D., Zhou, X., Li, J., Liu, H., Pui, C., Evans, W., Naeve, C., Wong, L. & Downing, J. 2002, 'Classification, Subtype Discovery, And Prediction Of Outcome In Pediatric Acute Lymphoblastic Leukemia By Gene Expression Profiling', Cancer Cell, vol. 1, no. 2, pp. 133-143.
View/Download from: Publisher's site
Treatment of pediatric acute lymphoblastic leukemia (ALL) is based on the concept of tailoring the intensity of therapy to a patient's risk of relapse. To determine whether gene expression profiling could enhance risk assignment, we used oligonucleotide
Liu, H., Li, J. & Wong, L. 2002, 'A comparative study on feature selection and classification methods using gene expression profiles and proteomic patterns.', Genome informatics. International Conference on Genome Informatics, vol. 13, pp. 51-60.
Feature selection plays an important role in classification. We present a comparative study on six feature selection heuristics by applying them to two sets of data. The first set of data are gene expression profiles from Acute Lymphoblastic Leukemia (ALL) patients. The second set of data are proteomic patterns from ovarian cancer patients. Based on features chosen by these methods, error rates of several classification algorithms were obtained for analysis. Our results demonstrate the importance of feature selection in accurately classifying new samples.
Li, J., Dong, G. & Ramamohanarao, K. 2000, 'Making Use Of The Most Expressive Jumping Emerging Patterns For Classification', Knowledge Discovery And Data Mining, Proceedings: Current Issues And New Applications, vol. 1805, no. NA, pp. 220-232.
Classification aims to discover a model from training data that can be used to predict the class of test instances. In this paper, we propose the use of jumping emerging patterns (JEPs) as the basis for a new classifier called them JEP-Classifier. Each J
Li, J. & Chow, T. 1997, 'Stochastic Choice Of Basis Functions In Adaptive Function Approximation And The Functional-link Net - Comments', IEEE Transactions on Neural Networks, vol. 8, no. 2, pp. 452-454.
This paper includes some comments and amendments of the above-mentioned paper, Subsequently, Theorem 1 in the above-mentioned paper has been revised, The significant change of the original theorem is the space of the thresholds in the hidden layer, The re
Chow, T. & Li, J. 1997, 'Higher-order Petri Net Models Based On Artificial Neural Networks', Artificial Intelligence, vol. 92, no. 40940, pp. 289-300.
In this paper, the properties of higher-order neural networks are exploited in a new class of Petri nets, called higher-order Petri nets (HOPN). Using the similarities between neural networks and Petri nets this paper demonstrates how the McCullock-Pitts
Chow, T.W.S. & Li, J.Y. 1997, 'Higher-order Petri net models based on artificial neural networks', Artificial Intelligence, vol. 92, no. 1-2, pp. 289-300.
In this paper, the properties of higher-order neural networks are exploited in a new class of Petri nets, called higher-order Petri nets (HOPN). Using the similarities between neural networks and Petri nets this paper demonstrates how the McCullock-Pitts models and the higher-order neural networks can be represented by Petri nets. A 5-tuple HOPN is defined, a theorem on the relationship between the potential firability of the goal transition and the T-invariant (HOPN) is proved and discussed. The proposed HOPN can be applied to the polynomial clause subset of first-order predicate logic. A five-clause polynomial logic program example is also included to illustrate the theoretical results. &copy; 1997 Elsevier Science B.V.
Li, J. & Chow, T. 1996, 'Function approximation of higher-order neural networks', Journal of Intelligent Systems, vol. 6, no. 3-4, pp. 239-260.