Dr. Jinyan Li is a Professor of Data Sciene at the Advanced Analytics Institute and a core member at the Centre for Health Technologies, Faculty of Engineering and IT, UTS. He is also the Bioinformatics Program leader. Jinyan has a Bachelor degree of Science (Applied Mathematics) from National University of Defense Technology (China), a Masters degree of Engineering (Computer Engineering) from Hebei University of Technology (China), and a PhD degree (Computer Science) from the University of Melbourne (Australia). He joined UTS in March of 2011 after ten years of research and teaching work in Singapore (Institute for Infocomm Research, Nanyang Technological University, and National University of Singapore).
Jinyan is passionate about research on protein bindng free energy prediction, conformational B-cell epitope prediction, PPIs, disease-RNA-gene tripartite, NGS data management, and RNA-seq data anaysis. He also loves research on data mining algorithms and new machine learning methods. He has published 90 journal articles and 80 conference papers, of which many are highly cited. The journals he likes to publish include: Machine Learning, Artificial Intelligence, Data Mining and Knowledge Discovery, IEEE TKDE, Bioinformatics, TCBB, BMC Genomics, BMC Bioinformatics, Nucleic Acids Research and Cancer Cell. The conferences include KDD, ICML, PODS, ICDT, ICDE, ICDM and SDM. Jinyan has 4 patents.
Jinyan is widely known for his pioneering and theoretical research work on Emerging Patterns that has spawned numerous follow-up research interests in data mining, machine learning, and bioinformatics and made an enduring contribution to these fields.
Associate Editor, BMC Bioinformatics
Academic Editor, PLoS ONE
Editorial board member, Information Systems
PC Co-chair, ADMA 2016
PC C-chair, ICIC 2016
Workshop Co-chair, PAKDD 2016
Can supervise: YES
Data science, bioinformatics and computational biology, immunoinformatics, data mining, graph theory, machine learning, and information theory.
Introduction to bioinformatics, data mining and advanced data analysis.
Li, JY & Wong, LS 2003, Using rules to analyse bio-medical data: A comparison between c4.5 and PCL, SPRINGER-VERLAG BERLIN.
IEEE The latest sequencing technologies such as the Pacific Biosciences (PacBio) and Oxford Nanopore machines can generate long reads at the length of thousands of nucleic bases which is much longer than the reads at the length of hundreds generated by Illumina machines. However, these long reads are prone to much higher error rates, for example 15%, making downstream analysis and applications very difficult. Error correction is a process to improve the quality of sequencing data. Hybrid correction strategies have been recently proposed to combine Illumina reads of low error rates to fix sequencing errors in the noisy long reads with good performance. In this paper, we propose a new method named Bicolor, a bi-level framework of hybrid error correction for further improving the quality of PacBio long reads. At the first level, our method uses a de Bruijn graph-based error correction idea to search paths in pairs of solid < formula > < tex > $k$ < /tex > < /formula > -mers iteratively with an increasing length of < formula > < tex > $k$ < /tex > < /formula > -mer. At the second level, we combine the processed results under different parameters from the first level. In particular, a multiple sequence alignment algorithm is used to align those similar long reads, followed by a voting algorithm which determines the final base at each position of the reads. We compare the superior performance of Bicolor with three state-of-the-art methods on three real data sets. Results demonstrate that Bicolor always achieves the highest identity ratio. Bicolor also achieves a higher alignment ratio ( < formula > < tex > $ & #x003E; 1.3\%$ < /tex > < /formula > ) and a higher number of aligned reads than the current methods on two data sets. On the third data set, our method is closely competitive to the current methods in terms of number of aligned reads and genome coverage. The C++ source codes of our algorithm are freely available at https://github.com/yuansliu/Bicolor.
Zhang, X, Zhao, Z, Zheng, Y & Li, J 2020, 'Prediction of Taxi Destinations Using a Novel Data Embedding Method and Ensemble Learning', IEEE Transactions on Intelligent Transportation Systems, vol. 21, no. 1, pp. 68-78.View/Download from: Publisher's site
Liu, PY, Tee, AE, Milazzo, G, Hannan, KM, Maag, J, Mondal, S, Atmadibrata, B, Bartonicek, N, Peng, H, Ho, N, Mayoh, C, Ciaccio, R, Sun, Y, Henderson, MJ, Gao, J, Everaert, C, Hulme, AJ, Wong, M, Lan, Q, Cheung, BB, Shi, L, Wang, JY, Simon, T, Fischer, M, Zhang, XD, Marshall, GM, Norris, MD, Haber, M, Vandesompele, J, Li, J, Mestdagh, P, Hannan, RD, Dinger, ME, Perini, G & Liu, T 2019, 'The long noncoding RNA lncNB1 promotes tumorigenesis by interacting with ribosomal protein RPL35', Nature Communications, vol. 10, no. 1.View/Download from: Publisher's site
© 2019, The Author(s). The majority of patients with neuroblastoma due to MYCN oncogene amplification and consequent N-Myc oncoprotein over-expression die of the disease. Here our analyses of RNA sequencing data identify the long noncoding RNA lncNB1 as one of the transcripts most over-expressed in MYCN-amplified, compared with MYCN-non-amplified, human neuroblastoma cells and also the most over-expressed in neuroblastoma compared with all other cancers. lncNB1 binds to the ribosomal protein RPL35 to enhance E2F1 protein synthesis, leading to DEPDC1B gene transcription. The GTPase-activating protein DEPDC1B induces ERK protein phosphorylation and N-Myc protein stabilization. Importantly, lncNB1 knockdown abolishes neuroblastoma cell clonogenic capacity in vitro and leads to neuroblastoma tumor regression in mice, while high levels of lncNB1 and RPL35 in human neuroblastoma tissues predict poor patient prognosis. This study therefore identifies lncNB1 and its binding protein RPL35 as key factors for promoting E2F1 protein synthesis, N-Myc protein stability and N-Myc-driven oncogenesis, and as therapeutic targets.
Liu, Y, Yu, Y, Dinger, ME & Li, J 2019, 'Index suffix-prefix overlaps by (w, k)-minimizer to generate long contigs for reads compression.', Bioinformatics, vol. 35, no. 12, pp. 2066-2074.View/Download from: Publisher's site
Motivation:Advanced high-throughput sequencing technologies have produced massive amount of reads data, and algorithms have been specially designed to contract the size of these data sets for efficient storage and transmission. Reordering reads with regard to their positions in de novo assembled contigs or in explicit reference sequences has been proven to be one of the most effective reads compression approach. As there is usually no good prior knowledge about the reference sequence, current focus is on the novel construction of de novo assembled contigs. Results:We introduce a new de novo compression algorithm named minicom. This algorithm uses large k-minimizers to index the reads and subgroup those that have the same minimizer. Within each subgroup, a contig is constructed. Then some pairs of the contigs derived from the subgroups are merged into longer contigs according to a (w; k)-minimizer indexed suffix-prefix overlap similarity between two contigs. This merging process is repeated after the longer contigs are formed until no pair of contigs can be merged. We compare the performance of minicom with two reference-based methods and four de novo methods on 18 data sets (13 RNA-seq data sets and 5 whole genome sequencing data sets). In the compression of single-end reads, minicom obtained the smallest file size for 22 of 34 cases with significant improvement. In the compression of paired-end reads, minicom achieved 20-80% compression gain over the best state-of-the-art algorithm. Our method also achieved a 10% size reduction of compressed files in comparison with the best algorithm under the reads-order preserving mode. These excellent performances are mainly attributed to the exploit of the redundancy of the repetitive substrings in the long contigs. Availability and Implementation:https://github.com/yuansliu/minicom. Supplementary Information:Supplementary data are available at Bioinformatics online.
Liu, Y, Zhang, LY, Li, J & Hancock, J 2019, 'Fast detection of maximal exact matches via fixed sampling of query K-mers and Bloom filtering of index K-mers', Bioinformatics, vol. 35, no. 22, pp. 4560-4567.View/Download from: Publisher's site
© 2019 The Author(s). All rights reserved. Detection of maximal exact matches (MEMs) between two long sequences is a fundamental problem in pairwise reference-query genome comparisons. To efficiently compare larger and larger genomes, reducing the number of indexed k-mers as well as the number of query k-mers has been adopted as a mainstream approach which saves the computational resources by avoiding a significant number of unnecessary matches. Results: Under this framework, we proposed a new method to detect all MEMs from a pair of genomes. The method first performs a fixed sampling of k-mers on the query sequence, and adds these selected k-mers to a Bloom filter. Then all the k-mers of the reference sequence are tested by the Bloom filter. If a k-mer passes the test, it is inserted into a hash table for indexing. Compared with the existing methods, much less number of query k-mers are generated and much less k-mers are inserted into the index to avoid unnecessary matches, leading to an efficient matching process and memory usage savings. Experiments on large genomes demonstrate that our method is at least 1.8 times faster than the best of the existing algorithms. This performance is mainly attributed to the key novelty of our method that the fixed k-mer sampling must be conducted on the query sequence and the index k-mers are filtered from the reference sequence via a Bloom filter.
Tang, T, Liu, Y, Zhang, B, Su, B & Li, J 2019, 'Sketch distance-based clustering of chromosomes for large genome database compression.', BMC genomics, vol. 20, no. Suppl 10, pp. 978-978.View/Download from: Publisher's site
BACKGROUND:The rapid development of Next-Generation Sequencing technologies enables sequencing genomes with low cost. The dramatically increasing amount of sequencing data raised crucial needs for efficient compression algorithms. Reference-based compression algorithms have exhibited outstanding performance on compressing single genomes. However, for the more challenging and more useful problem of compressing a large collection of n genomes, straightforward application of these reference-based algorithms suffers a series of issues such as difficult reference selection and remarkable performance variation. RESULTS:We propose an efficient clustering-based reference selection algorithm for reference-based compression within separate clusters of the n genomes. This method clusters the genomes into subsets of highly similar genomes using MinHash sketch distance, and uses the centroid sequence of each cluster as the reference genome for an outstanding reference-based compression of the remaining genomes in each cluster. A final reference is then selected from these reference genomes for the compression of the remaining reference genomes. Our method significantly improved the performance of the-state-of-art compression algorithms on large-scale human and rice genome databases containing thousands of genome sequences. The compression ratio gain can reach up to 20-30% in most cases for the datasets from NCBI, the 1000 Human Genomes Project and the 3000 Rice Genomes Project. The best improvement boosts the performance from 351.74 compression folds to 443.51 folds. CONCLUSIONS:The compression ratio of reference-based compression on large scale genome datasets can be improved via reference selection by applying appropriate data preprocessing and clustering methods. Our algorithm provides an efficient way to compress large genome database.
Zhang, B, Li, J, Quan, L, Chen, Y & Lü, Q 2019, 'Sequence-based prediction of protein-protein interaction sites by simplified long short-term memory network', Neurocomputing, vol. 357, pp. 86-100.View/Download from: Publisher's site
© 2019 Elsevier B.V. Proteins often interact with each other and form protein complexes to carry out various biochemical activities. Knowledge of the interaction sites is helpful for understanding disease mechanisms and drug design. Accurate prediction of the interaction sites from protein sequences is still a challenging task and severe imbalance data also decreased the performance of computational methods. In this study, we propose to use a deep learning method for improving the imbalanced prediction of protein interaction sites. We develop a new simplified long short-term memory (SLSTM) network to implement a deep learning architecture (named DLPred). To deal with the imbalanced classification in the deep learning model, we explore three new ideas. First, our collection of the training data is to construct a set of protein sequences, instead of a set of just single residues, to retain the entire sequential completeness of each protein. Second, a new penalization factor is appended to the loss function such that the penalization to the non-interaction site loss can be effectively enhanced. Third, multi-task learning of interaction sites and residue solvent accessibility prediction are used for correcting the preference of the prediction model on the non-interaction sites. Our model is evaluated on three public datasets: Dset186, Dtestset72 and PDBtestset164. Compared with current state-of-the-art methods, DLPred is able to significantly improve the predictive accuracies and AUC values while improving the F-measure. The training dataset, test datasets, a standalone version of DLPred and online service are available at http://qianglab.scst.suda.edu.cn/dlp/.
Zheng, Y, Peng, H, Ghosh, S, Lan, C & Li, J 2019, 'Inverse similarity and reliable negative samples for drug side-effect prediction.', BMC bioinformatics, vol. 19, no. Suppl 13, pp. 554-554.View/Download from: Publisher's site
BACKGROUND:In silico prediction of potential drug side-effects is of crucial importance for drug development, since wet experimental identification of drug side-effects is expensive and time-consuming. Existing computational methods mainly focus on leveraging validated drug side-effect relations for the prediction. The performance is severely impeded by the lack of reliable negative training data. Thus, a method to select reliable negative samples becomes vital in the performance improvement. METHODS:Most of the existing computational prediction methods are essentially based on the assumption that similar drugs are inclined to share the same side-effects, which has given rise to remarkable performance. It is also rational to assume an inverse proposition that dissimilar drugs are less likely to share the same side-effects. Based on this inverse similarity hypothesis, we proposed a novel method to select highly-reliable negative samples for side-effect prediction. The first step of our method is to build a drug similarity integration framework to measure the similarity between drugs from different perspectives. This step integrates drug chemical structures, drug target proteins, drug substituents, and drug therapeutic information as features into a unified framework. Then, a similarity score between each candidate negative drug and validated positive drugs is calculated using the similarity integration framework. Those candidate negative drugs with lower similarity scores are preferentially selected as negative samples. Finally, both the validated positive drugs and the selected highly-reliable negative samples are used for predictions. RESULTS:The performance of the proposed method was evaluated on simulative side-effect prediction of 917 DrugBank drugs, comparing with four machine-learning algorithms. Extensive experiments show that the drug similarity integration framework has superior capability in capturing drug features, achieving much better performance tha...
Lan, C, Peng, H, Hutvagner, G & Li, J 2019, 'Construction of competing endogenous RNA networks from paired RNA-seq data sets by pointwise mutual information.', BMC genomics, vol. 20, no. Suppl 9.View/Download from: Publisher's site
BACKGROUND:A long noncoding RNA (lncRNA) can act as a competing endogenous RNA (ceRNA) to compete with an mRNA for binding to the same miRNA. Such an interplay between the lncRNA, miRNA, and mRNA is called a ceRNA crosstalk. As an miRNA may have multiple lncRNA targets and multiple mRNA targets, connecting all the ceRNA crosstalks mediated by the same miRNA forms a ceRNA network. Methods have been developed to construct ceRNA networks in the literature. However, these methods have limits because they have not explored the expression characteristics of total RNAs. RESULTS:We proposed a novel method for constructing ceRNA networks and applied it to a paired RNA-seq data set. The first step of the method takes a competition regulation mechanism to derive candidate ceRNA crosstalks. Second, the method combines a competition rule and pointwise mutual information to compute a competition score for each candidate ceRNA crosstalk. Then, ceRNA crosstalks which have significant competition scores are selected to construct the ceRNA network. The key idea, pointwise mutual information, is ideally suitable for measuring the complex point-to-point relationships embedded in the ceRNA networks. CONCLUSION:Computational experiments and results demonstrate that the ceRNA networks can capture important regulatory mechanism of breast cancer, and have also revealed new insights into the treatment of breast cancer. The proposed method can be directly applied to other RNA-seq data sets for deeper disease understanding.
Zhao, Z, Peng, H, Zhang, X, Zheng, Y, Chen, F, Fang, L & Li, J 2019, 'Identification of lung cancer gene markers through kernel maximum mean discrepancy and information entropy.', BMC medical genomics, vol. 12, no. Suppl 8.View/Download from: Publisher's site
BACKGROUND:The early diagnosis of lung cancer has been a critical problem in clinical practice for a long time and identifying differentially expressed gene as disease marker is a promising solution. However, the most existing gene differential expression analysis (DEA) methods have two main drawbacks: First, these methods are based on fixed statistical hypotheses and not always effective; Second, these methods can not identify a certain expression level boundary when there is no obvious expression level gap between control and experiment groups. METHODS:This paper proposed a novel approach to identify marker genes and gene expression level boundary for lung cancer. By calculating a kernel maximum mean discrepancy, our method can evaluate the expression differences between normal, normal adjacent to tumor (NAT) and tumor samples. For the potential marker genes, the expression level boundaries among different groups are defined with the information entropy method. RESULTS:Compared with two conventional methods t-test and fold change, the top average ranked genes selected by our method can achieve better performance under all metrics in the 10-fold cross-validation. Then GO and KEGG enrichment analysis are conducted to explore the biological function of the top 100 ranked genes. At last, we choose the top 10 average ranked genes as lung cancer markers and their expression boundaries are calculated and reported. CONCLUSION:The proposed approach is effective to identify gene markers for lung cancer diagnosis. It is not only more accurate than conventional DEA methods but also provides a reliable method to identify the gene expression level boundaries.
Zheng, Y, Peng, H, Zhang, X, Zhao, Z, Gao, X & Li, J 2019, 'DDI-PULearn: a positive-unlabeled learning method for large-scale prediction of drug-drug interactions.', BMC bioinformatics, vol. 20, no. Suppl 19.View/Download from: Publisher's site
BACKGROUND:Drug-drug interactions (DDIs) are a major concern in patients' medication. It's unfeasible to identify all potential DDIs using experimental methods which are time-consuming and expensive. Computational methods provide an effective strategy, however, facing challenges due to the lack of experimentally verified negative samples. RESULTS:To address this problem, we propose a novel positive-unlabeled learning method named DDI-PULearn for large-scale drug-drug-interaction predictions. DDI-PULearn first generates seeds of reliable negatives via OCSVM (one-class support vector machine) under a high-recall constraint and via the cosine-similarity based KNN (k-nearest neighbors) as well. Then trained with all the labeled positives (i.e., the validated DDIs) and the generated seed negatives, DDI-PULearn employs an iterative SVM to identify a set of entire reliable negatives from the unlabeled samples (i.e., the unobserved DDIs). Following that, DDI-PULearn represents all the labeled positives and the identified negatives as vectors of abundant drug properties by a similarity-based method. Finally, DDI-PULearn transforms these vectors into a lower-dimensional space via PCA (principal component analysis) and utilizes the compressed vectors as input for binary classifications. The performance of DDI-PULearn is evaluated on simulative prediction for 149,878 possible interactions between 548 drugs, comparing with two baseline methods and five state-of-the-art methods. Related experiment results show that the proposed method for the representation of DDIs characterizes them accurately. DDI-PULearn achieves superior performance owing to the identified reliable negatives, outperforming all other methods significantly. In addition, the predicted novel DDIs suggest that DDI-PULearn is capable to identify novel DDIs. CONCLUSIONS:The results demonstrate that positive-unlabeled learning paves a new way to tackle the problem caused by the lack of experimentally verified nega...
Zheng, Y, Peng, H, Zhang, X, Zhao, Z, Gao, X & Li, J 2019, 'Old drug repositioning and new drug discovery through similarity learning from drug-target joint feature spaces', BMC BIOINFORMATICS, vol. 20, no. 1.View/Download from: Publisher's site
Ho, N, Peng, H, Mayoh, C, Liu, PY, Atmadibrata, B, Marshall, GM, Li, J & Liu, T 2018, 'Delineation of the frequency and boundary of chromosomal copy number variations in paediatric neuroblastoma', CELL CYCLE, vol. 17, no. 6, pp. 749-758.View/Download from: Publisher's site
Li, J, Nakai, K, Zheng, Y, Sato, K & Wong, L 2018, 'Introduction to Selected Papers from GIW2018', JOURNAL OF BIOINFORMATICS AND COMPUTATIONAL BIOLOGY, vol. 16, no. 6.View/Download from: Publisher's site
Liu, Q, Chen, P, Wang, B, Zhang, J & Li, J 2018, 'Hot spot prediction in protein-protein interactions by an ensemble system.', BMC systems biology, vol. 12, no. Suppl 9, pp. 132-132.View/Download from: Publisher's site
BACKGROUND:Hot spot residues are functional sites in protein interaction interfaces. The identification of hot spot residues is time-consuming and laborious using experimental methods. In order to address the issue, many computational methods have been developed to predict hot spot residues. Moreover, most prediction methods are based on structural features, sequence characteristics, and/or other protein features. RESULTS:This paper proposed an ensemble learning method to predict hot spot residues that only uses sequence features and the relative accessible surface area of amino acid sequences. In this work, a novel feature selection technique was developed, an auto-correlation function combined with a sliding window technique was applied to obtain the characteristics of amino acid residues in protein sequence, and an ensemble classifier with SVM and KNN base classifiers was built to achieve the best classification performance. CONCLUSION:The experimental results showed that our model yields the highest F1 score of 0.92 and an MCC value of 0.87 on ASEdb dataset. Compared with other machine learning methods, our model achieves a big improvement in hot spot prediction. AVAILABILITY:http://deeplearner.ahu.edu.cn/web/HotspotEL.htm .
Liu, Q, Ghosh, S, Li, J, Wong, L & Ramamohanarao, K 2018, 'Discovering pan-correlation patterns from time course data sets by efficient mining algorithms', Computing, vol. 100, no. 4, pp. 421-437.View/Download from: Publisher's site
© 2018, Springer-Verlag GmbH Austria, part of Springer Nature. Time-course correlation patterns can be positive or negative, and time-lagged with gaps. Mining all these correlation patterns help to gain broad insights on variable dependencies. Here, we prove that diverse types of correlation patterns can be represented by a generalized form of positive correlation patterns. We prove a correspondence between positive correlation patterns and sequential patterns, and present an efficient single-scan algorithm for mining the correlations. Evaluations on synthetic time course data sets, and yeast cell cycle gene expression data sets indicate that: (1) the algorithm has linear time increment in terms of increasing number of variables; (2) negative correlation patterns are abundant in real-world data sets; and (3) correlation patterns with time lags and gaps are abundant. Existing methods have only discovered incomplete forms of many of these patterns, and have missed some important patterns completely.
Ma, Y, Yu, Z, Han, G, Li, J & Anh, V 2018, 'Identification of pre-microRNAs by characterizing their sequence order evolution information and secondary structure graphs.', BMC bioinformatics, vol. 19, no. Suppl 19, pp. 521-521.View/Download from: Publisher's site
BACKGROUND:Distinction between pre-microRNAs (precursor microRNAs) and length-similar pseudo pre-microRNAs can reveal more about the regulatory mechanism of RNA biological processes. Machine learning techniques have been widely applied to deal with this challenging problem. However, most of them mainly focus on secondary structure information of pre-microRNAs, while ignoring sequence-order information and sequence evolution information. RESULTS:We use new features for the machine learning algorithms to improve the classification performance by characterizing both sequence order evolution information and secondary structure graphs. We developed three steps to extract these features of pre-microRNAs. We first extract features from PSI-BLAST profiles and Hilbert-Huang transforms, which contain rich sequence evolution information and sequence-order information respectively. We then obtain properties of small molecular networks of pre-microRNAs, which contain refined secondary structure information. These structural features are carefully generated so that they can depict both global and local characteristics of pre-microRNAs. In total, our feature space covers 591 features. The maximum relevance and minimum redundancy (mRMR) feature selection method is adopted before support vector machine (SVM) is applied as our classifier. The constructed classification model is named MicroRNA -NHPred. The performance of MicroRNA -NHPred is high and stable, which is better than that of those state-of-the-art methods, achieving an accuracy of up to 94.83% on same benchmark datasets. CONCLUSIONS:The high prediction accuracy achieved by our proposed method is attributed to the design of a comprehensive feature set on the sequences and secondary structures, which are capable of characterizing the sequence evolution information and sequence-order information, and global and local information of pre-microRNAs secondary structures. MicroRNA -NHPred is a valuable method for pre-microRNAs i...
Peng, H, Zheng, Y, Blumenstein, M, Tao, D & Li, J 2018, 'CRISPR/Cas9 cleavage efficiency regression through boosting algorithms and Markov sequence profiling', BIOINFORMATICS, vol. 34, no. 18, pp. 3069-3077.View/Download from: Publisher's site
BACKGROUND:Protein secondary structure can be regarded as an information bridge that links the primary sequence and tertiary structure. Accurate 8-state secondary structure prediction can significantly give more precise and high resolution on structure-based properties analysis. RESULTS:We present a novel deep learning architecture which exploits an integrative synergy of prediction by a convolutional neural network, residual network, and bidirectional recurrent neural network to improve the performance of protein secondary structure prediction. A local block comprised of convolutional filters and original input is designed for capturing local sequence features. The subsequent bidirectional recurrent neural network consisting of gated recurrent units can capture global context features. Furthermore, the residual network can improve the information flow between the hidden layers and the cascaded recurrent neural network. Our proposed deep network achieved 71.4% accuracy on the benchmark CB513 dataset for the 8-state prediction; and the ensemble learning by our model achieved 74% accuracy. Our model generalization capability is also evaluated on other three independent datasets CASP10, CASP11 and CASP12 for both 8- and 3-state prediction. These prediction performances are superior to the state-of-the-art methods. CONCLUSION:Our experiment demonstrates that it is a valuable method for predicting protein secondary structure, and capturing local and global features concurrently is very useful in deep learning.
Zhao, L, Wu, S, Jiang, J, Li, W, Luo, J & Li, J 2018, 'Novel overlapping subgraph clustering for the detection of antigen epitopes', BIOINFORMATICS, vol. 34, no. 12, pp. 2061-2068.View/Download from: Publisher's site
Zhao, L, Xie, J, Bai, L, Chen, W, Wang, M, Zhang, Z, Wang, Y, Zhao, Z & Li, J 2018, 'Mining statistically-solid k-mers for accurate NGS error correction.', BMC genomics, vol. 19, no. Suppl 10, pp. 912-912.View/Download from: Publisher's site
BACKGROUND:NGS data contains many machine-induced errors. The most advanced methods for the error correction heavily depend on the selection of solid k-mers. A solid k-mer is a k-mer frequently occurring in NGS reads. The other k-mers are called weak k-mers. A solid k-mer does not likely contain errors, while a weak k-mer most likely contains errors. An intensively investigated problem is to find a good frequency cutoff f0 to balance the numbers of solid and weak k-mers. Once the cutoff is determined, a more challenging but less-studied problem is to: (i) remove a small subset of solid k-mers that are likely to contain errors, and (ii) add a small subset of weak k-mers, that are likely to contain no errors, into the remaining set of solid k-mers. Identification of these two subsets of k-mers can improve the correction performance. RESULTS:We propose to use a Gamma distribution to model the frequencies of erroneous k-mers and a mixture of Gaussian distributions to model correct k-mers, and combine them to determine f0. To identify the two special subsets of k-mers, we use the z-score of k-mers which measures the number of standard deviations a k-mer's frequency is from the mean. Then these statistically-solid k-mers are used to construct a Bloom filter for error correction. Our method is markedly superior to the state-of-art methods, tested on both real and synthetic NGS data sets. CONCLUSION:The z-score is adequate to distinguish solid k-mers from weak k-mers, particularly useful for pinpointing out solid k-mers having very low frequency. Applying z-score on k-mer can markedly improve the error correction accuracy.
Lan, C, Peng, H, McGowan, EM, Hutvagner, G & Li, J 2018, 'An isomiR expression panel based novel breast cancer classification approach using improved mutual information.', BMC medical genomics, vol. 11, no. Suppl 6, pp. 118-118.View/Download from: Publisher's site
BACKGROUND:Gene expression-based profiling has been used to identify biomarkers for different breast cancer subtypes. However, this technique has many limitations. IsomiRs are isoforms of miRNAs that have critical roles in many biological processes and have been successfully used to distinguish various cancer types. Biomarker isomiRs for identifying different breast cancer subtypes has not been investigated. For the first time, we aim to show that isomiRs are better performing biomarkers and use them to explain molecular differences between breast cancer subtypes. RESULTS:In this study, a novel method is proposed to identify specific isomiRs that faithfully classify breast cancer subtypes. First, as a null hypothesis method we removed the lowly expressed isomiRs from small sequencing data generated from diverse breast cancers types. Second, we developed an improved mutual information-based feature selection method to calculate the weight of each isomiR expression. The weight of isomiR measures the importance of a given isomiR in classifying breast cancer subtypes. The improved mutual information enables to apply the dataset in which the feature is continuous data and label is discrete data; whereby, the traditional mutual information cannot be applied in this dataset. Finally, the support vector machine (SVM) classifier is applied to find isomiR biomarkers for subtyping. CONCLUSIONS:Here we demonstrate that isomiRs can be used as biomarkers in the identification of different breast cancer subtypes, and in addition, they may provide new insights into the diverse molecular mechanisms of breast cancers. We have also shown that the classification of different subtypes of breast cancer based on isomiRs expression is more effective than using published gene expression profiling. The proposed method provides a better performance outcome than Fisher method and Hellinger method for discovering biomarkers to distinguish different breast cancer subtypes. This novel techniqu...
Zheng, Y, Peng, H, Zhang, X, Zhao, Z, Yin, J & Li, J 2018, 'Predicting adverse drug reactions of combined medication from heterogeneous pharmacologic databases.', BMC bioinformatics, vol. 19, no. Suppl 19, pp. 49-59.View/Download from: Publisher's site
BACKGROUND:Early and accurate identification of potential adverse drug reactions (ADRs) for combined medication is vital for public health. Existing methods either rely on expensive wet-lab experiments or detecting existing associations from related records. Thus, they inevitably suffer under-reporting, delays in reporting, and inability to detect ADRs for new and rare drugs. The current application of machine learning methods is severely impeded by the lack of proper drug representation and credible negative samples. Therefore, a method to represent drugs properly and to select credible negative samples becomes vital in applying machine learning methods to this problem. RESULTS:In this work, we propose a machine learning method to predict ADRs of combined medication from pharmacologic databases by building up highly-credible negative samples (HCNS-ADR). Specifically, we fuse heterogeneous information from different databases and represent each drug as a multi-dimensional vector according to its chemical substructures, target proteins, substituents, and related pathways first. Then, a drug-pair vector is obtained by appending the vector of one drug to the other. Next, we construct a drug-disease-gene network and devise a scoring method to measure the interaction probability of every drug pair via network analysis. Drug pairs with lower interaction probability are preferentially selected as negative samples. Following that, the validated positive samples and the selected credible negative samples are projected into a lower-dimensional space using the principal component analysis. Finally, a classifier is built for each ADR using its positive and negative samples with reduced dimensions. The performance of the proposed method is evaluated on simulative prediction for 1276 ADRs and 1048 drugs, comparing using four machine learning algorithms and with two baseline approaches. Extensive experiments show that the proposed way to represent drugs characterizes drugs accu...
Peng, H, Zheng, Y, Zhao, Z, Liu, T & Li, J 2018, 'Recognition of CRISPR/Cas9 off-target sites through ensemble learning of uneven mismatch distributions.', Bioinformatics, vol. 34, no. 17, pp. i757-i765.View/Download from: Publisher's site
Motivation:CRISPR/Cas9 is driving a broad range of innovative applications from basic biology to biotechnology and medicine. One of its current issues is the effect of off-target editing that should be critically resolved and should be completely avoided in the ideal use of this system. Results:We developed an ensemble learning method to detect the off-target sites of a single guide RNA (sgRNA) from its thousands of genome-wide candidates. Nucleotide mismatches between on-target and off-target sites have been studied recently. We confirm that there exists strong mismatch enrichment and preferences at the 5'-end close regions of the off-target sequences. Comparing with the on-target sites, sequences of no-editing sites can be also characterized by GC composition changes and position-specific mismatch binary features. Under this novel space of features, an ensemble strategy was applied to train a prediction model. The model achieved a mean score 0.99 of Aera Under Receiver Operating Characteristic curve and a mean score 0.45 of Aera Under Precision-Recall curve in cross-validations on big datasets, outperforming state-of-the-art methods in various test scenarios. Our predicted off-target sites also correspond very well to those detected by high-throughput sequencing techniques. Especially, two case studies for selecting sgRNAs to cure hearing loss and retinal degeneration partly prove the effectiveness of our method. Availability and implementation:The python and matlab version of source codes for detecting off-target sites of a given sgRNA and the supplementary files are freely available on the web at https://github.com/penn-hui/OfftargetPredict. Supplementary information:Supplementary data are available at Bioinformatics online.
Chen, Q, Lan, C, Chen, B, Wang, L, Li, J & Zhang, C 2017, 'Exploring Consensus RNA Substructural Patterns Using Subgraph Mining', IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, vol. 14, no. 5, pp. 1134-1146.View/Download from: Publisher's site
Chen, Q, Wang, Y, Chen, B, Zhang, C, Wang, L & Li, J 2017, 'Using propensity scores to predict the kinases of unannotated phosphopeptides', KNOWLEDGE-BASED SYSTEMS, vol. 135, pp. 60-76.View/Download from: Publisher's site
Ghosh, S, Li, J, Cao, L & Ramamohanarao, K 2017, 'Septic shock prediction for ICU patients via coupled HMM walking on sequential contrast patterns.', Journal of Biomedical Informatics, vol. 66, pp. 19-31.View/Download from: Publisher's site
BACKGROUND AND OBJECTIVE: Critical care patient events like sepsis or septic shock in intensive care units (ICUs) are dangerous complications which can cause multiple organ failures and eventual death. Preventive prediction of such events will allow clinicians to stage effective interventions for averting these critical complications. METHODS: It is widely understood that physiological conditions of patients on variables such as blood pressure and heart rate are suggestive to gradual changes over a certain period of time, prior to the occurrence of a septic shock. This work investigates the performance of a novel machine learning approach for the early prediction of septic shock. The approach combines highly informative sequential patterns extracted from multiple physiological variables and captures the interactions among these patterns via coupled hidden Markov models (CHMM). In particular, the patterns are extracted from three non-invasive waveform measurements: the mean arterial pressure levels, the heart rates and respiratory rates of septic shock patients from a large clinical ICU dataset called MIMIC-II. EVALUATION AND RESULTS: For baseline estimations, SVM and HMM models on the continuous time series data for the given patients, using MAP (mean arterial pressure), HR (heart rate), and RR (respiratory rate) are employed. Single channel patterns based HMM (SCP-HMM) and multi-channel patterns based coupled HMM (MCP-HMM) are compared against baseline models using 5-fold cross validation accuracies over multiple rounds. Particularly, the results of MCP-HMM are statistically significant having a p-value of 0.0014, in comparison to baseline models. Our experiments demonstrate a strong competitive accuracy in the prediction of septic shock, especially when the interactions between the multiple variables are coupled by the learning model. CONCLUSIONS: It can be concluded that the novelty of the approach, stems from the integration of sequence-based physiological pa...
Hasan, MAM, Li, J, Ahmad, S & Molla, MKI 2017, 'predCar-site: Carbonylation sites prediction in proteins using support vector machine with resolving data imbalanced issue.', Analytical biochemistry, vol. 525, pp. 107-113.View/Download from: Publisher's site
The carbonylation is found as an irreversible post-translational modification and considered a biomarker of oxidative stress. It plays major role not only in orchestrating various biological processes but also associated with some diseases such as Alzheimer's disease, diabetes, and Parkinson's disease. However, since the experimental technologies are costly and time-consuming to detect the carbonylation sites in proteins, an accurate computational method for predicting carbonylation sites is an urgent issue which can be useful for drug development. In this study, a novel computational tool termed predCar-Site has been developed to predict protein carbonylation sites by (1) incorporating the sequence-coupled information into the general pseudo amino acid composition, (2) balancing the effect of skewed training dataset by Different Error Costs method, and (3) constructing a predictor using support vector machine as classifier. This predCar-Site predictor achieves an average AUC (area under curve) score of 0.9959, 0.9999, 1, and 0.9997 in predicting the carbonylation sites of K, P, R, and T, respectively. All of the experimental results along with AUC are found from the average of 5 complete runs of the 10-fold cross-validation and those results indicate significantly better performance than existing predictors. A user-friendly web server of predCar-Site is available at http://research.ru.ac.bd/predCar-Site/.
Hu, S-S, Chen, P, Wang, B & Li, J 2017, 'Protein binding hot spots prediction from sequence only by a new ensemble learning method.', Amino Acids, vol. 49, no. 10, pp. 1773-1785.View/Download from: Publisher's site
Hot spots are interfacial core areas of binding proteins, which have been applied as targets in drug design. Experimental methods are costly in both time and expense to locate hot spot areas. Recently, in-silicon computational methods have been widely used for hot spot prediction through sequence or structure characterization. As the structural information of proteins is not always solved, and thus hot spot identification from amino acid sequences only is more useful for real-life applications. This work proposes a new sequence-based model that combines physicochemical features with the relative accessible surface area of amino acid sequences for hot spot prediction. The model consists of 83 classifiers involving the IBk (Instance-based k means) algorithm, where instances are encoded by important properties extracted from a total of 544 properties in the AAindex1 (Amino Acid Index) database. Then top-performance classifiers are selected to form an ensemble by a majority voting technique. The ensemble classifier outperforms the state-of-the-art computational methods, yielding an F1 score of 0.80 on the benchmark binding interface database (BID) test set.http://www2.ahu.edu.cn/pchen/web/HotspotEC.htm .
Li, J, Fong, S, Wong, RK, Millham, R & Wong, KKL 2017, 'Elitist Binary Wolf Search Algorithm for Heuristic Feature Selection in High-Dimensional Bioinformatics Datasets', SCIENTIFIC REPORTS, vol. 7.View/Download from: Publisher's site
Li, J, Liu, L-S, Fong, S, Wong, RK, Mohammed, S, Fiaidhi, J, Sung, Y & Wong, KKL 2017, 'Adaptive Swarm Balancing Algorithms for rare-event prediction in imbalanced healthcare data', PLOS ONE, vol. 12, no. 7.View/Download from: Publisher's site
The rapidly increasing number of genomes generated by high-throughput sequencing platforms and assembly algorithms is accompanied by problems in data storage, compression and communication. Traditional compression algorithms are unable to meet the demand of high compression ratio due to the intrinsic challenging features of DNA sequences such as small alphabet size, frequent repeats and palindromes. Reference-based lossless compression, by which only the differences between two similar genomes are stored, is a promising approach with high compression ratio.We present a high-performance referential genome compression algorithm named HiRGC. It is based on a 2-bit encoding scheme and an advanced greedy-matching search on a hash table. We compare the performance of HiRGC with four state-of-the-art compression methods on a benchmark dataset of eight human genomes. HiRGC takes <30 min to compress about 21 gigabytes of each set of the seven target genomes into 96-260 megabytes, achieving compression ratios of 217 to 82 times. This performance is at least 1.9 times better than the best competing algorithm on its best case. Our compression speed is also at least 2.9 times faster. HiRGC is stable and robust to deal with different reference genomes. In contrast, the competing methods' performance varies widely on different reference genomes. More experiments on 100 human genomes from the 1000 Genome Project and on genomes of several other species again demonstrate that HiRGC's performance is consistently excellent.The C ++ and Java source codes of our algorithm are freely available for academic and non-commercial use. They can be downloaded from https://github.com/yuansliu/HiRGC.firstname.lastname@example.org.Supplementary data are available at Bioinformatics online.
Mao, R, Liang, C, Zhang, Y, Hao, X & Li, J 2017, '50/50 Expressional Odds of Retention Signifies the Distinction between Retained Introns and Constitutively Spliced Introns in Arabidopsis thaliana.', Frontiers in Plant Science, vol. 8, pp. 1-16.View/Download from: Publisher's site
Intron retention, one of the most prevalent alternative splicing events in plants, can lead to introns retained in mature mRNAs. However, in comparison with constitutively spliced introns (CSIs), the relevantly distinguishable features for retained introns (RIs) are still poorly understood. This work proposes a computational pipeline to discover novel RIs from multiple next-generation RNA sequencing (RNA-Seq) datasets of Arabidopsis thaliana. Using this pipeline, we detected 3,472 novel RIs from 18 RNA-Seq datasets and re-confirmed 1,384 RIs which are currently annotated in the TAIR10 database. We also use the expression of intron-containing isoforms as a new feature in addition to the conventional features. Based on these features, RIs are highly distinguishable from CSIs by machine learning methods, especially when the expressional odds of retention (i.e., the expression ratio of the RI-containing isoforms relative to the isoforms without RIs for the same gene) reaches to or larger than 50/50. In this case, the RIs and CSIs can be clearly separated by the Random Forest with an outstanding performance of 0.95 on AUC (the area under a receiver operating characteristics curve). The closely related characteristics to the RIs include the low strength of splice sites, high similarity with the flanking exon sequences, low occurrence percentage of YTRAY near the acceptor site, existence of putative intronic splicing silencers (ISSs, i.e., AG/GA-rich motifs) and intronic splicing enhancers (ISEs, i.e., TTTT-containing motifs), and enrichment of Serine/Arginine-Rich (SR) proteins and heterogeneous nuclear ribonucleoparticle proteins (hnRNPs).
Peng, H, Lan, C, Liu, Y, Liu, T, Blumenstein, M & Li, J 2017, 'Chromosome preference of disease genes and vectorization for the prediction of non-coding disease genes.', Oncotarget, vol. 8, no. 45, pp. 78901-78916.View/Download from: Publisher's site
Disease-related protein-coding genes have been widely studied, but disease-related non-coding genes remain largely unknown. This work introduces a new vector to represent diseases, and applies the newly vectorized data for a positive-unlabeled learning algorithm to predict and rank disease-related long non-coding RNA (lncRNA) genes. This novel vector representation for diseases consists of two sub-vectors, one is composed of 45 elements, characterizing the information entropies of the disease genes distribution over 45 chromosome substructures. This idea is supported by our observation that some substructures (e.g., the chromosome 6 p-arm) are highly preferred by disease-related protein coding genes, while some (e.g., the 21 p-arm) are not favored at all. The second sub-vector is 30-dimensional, characterizing the distribution of disease gene enriched KEGG pathways in comparison with our manually created pathway groups. The second sub-vector complements with the first one to differentiate between various diseases. Our prediction method outperforms the state-of-the-art methods on benchmark datasets for prioritizing disease related lncRNA genes. The method also works well when only the sequence information of an lncRNA gene is known, or even when a given disease has no currently recognized long non-coding genes.
Zhao, L, Chen, Q, Li, W, Jiang, P, Wong, L & Li, J 2017, 'MapReduce for accurate error correction of next-generation sequencing data.', Bioinformatics, vol. 33, no. 23, pp. 3844-3851.View/Download from: Publisher's site
Next-generation sequencing platforms have produced huge amounts of sequence data. This is revolutionizing every aspect of genetic and genomic research. However, these sequence datasets contain quite a number of machine-induced errors-e.g. errors due to substitution can be as high as 2.5%. Existing error-correction methods are still far from perfect. In fact, more errors are sometimes introduced than correct corrections, especially by the prevalent k-mer based methods. The existing methods have also made limited exploitation of on-demand cloud computing.We introduce an error-correction method named MEC, which uses a two-layered MapReduce technique to achieve high correction performance. In the first layer, all the input sequences are mapped to groups to identify candidate erroneous bases in parallel. In the second layer, the erroneous bases at the same position are linked together from all the groups for making statistically reliable corrections. Experiments on real and simulated datasets show that our method outperforms existing methods remarkably. Its per-position error rate is consistently the lowest, and the correction gain is always the highest.The source code is available at email@example.com or firstname.lastname@example.org.Supplementary data are available at Bioinformatics online.
Peng, H, Lan, C, Zheng, Y, Hutvagner, G, Tao, D & Li, J 2017, 'Cross disease analysis of co-functional microRNA pairs on a reconstructed network of disease-gene-microRNA tripartite.', BMC Bioinformatics, vol. 18, pp. 1-17.View/Download from: Publisher's site
MicroRNAs always function cooperatively in their regulation of gene expression. Dysfunctions of these co-functional microRNAs can play significant roles in disease development. We are interested in those multi-disease associated co-functional microRNAs that regulate their common dysfunctional target genes cooperatively in the development of multiple diseases. The research is potentially useful for human disease studies at the transcriptional level and for the study of multi-purpose microRNA therapeutics.We designed a computational method to detect multi-disease associated co-functional microRNA pairs and conducted cross disease analysis on a reconstructed disease-gene-microRNA (DGR) tripartite network. The construction of the DGR tripartite network is by the integration of newly predicted disease-microRNA associations with those relationships of diseases, microRNAs and genes maintained by existing databases. The prediction method uses a set of reliable negative samples of disease-microRNA association and a pre-computed kernel matrix instead of kernel functions. From this reconstructed DGR tripartite network, multi-disease associated co-functional microRNA pairs are detected together with their common dysfunctional target genes and ranked by a novel scoring method. We also conducted proof-of-concept case studies on cancer-related co-functional microRNA pairs as well as on non-cancer disease-related microRNA pairs.With the prioritization of the co-functional microRNAs that relate to a series of diseases, we found that the co-function phenomenon is not unusual. We also confirmed that the regulation of the microRNAs for the development of cancers is more complex and have more unique properties than those of non-cancer diseases.
Chen, P, Hu, S, Zhang, J, Gao, X, Li, J, Xia, J & Wang, B 2016, 'A Sequence-Based Dynamic Ensemble Learning System for Protein Ligand-Binding Site Prediction', IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, vol. 13, no. 5, pp. 901-912.View/Download from: Publisher's site
Ghosh, S, Feng, M, Nguyen, H & Li, J 2016, 'Hypotension Risk Prediction via Sequential Contrast Patterns of ICU Blood Pressure', IEEE Journal of Biomedical and Health Informatics, vol. 20, no. 5, pp. 1416-1426.View/Download from: Publisher's site
© 2013 IEEE. Acute hypotension is a significant risk factor for in-hospital mortality at intensive care units. Prolonged hypotension can cause tissue hypoperfusion, leading to cellular dysfunction and severe injuries to multiple organs. Prompt medical interventions are thus extremely important for dealing with acute hypotensive episodes (AHE). Population level prognostic scoring systems for risk stratification of patients are suboptimal in such scenarios. However, the design of an efficient risk prediction system can significantly help in the identification of critical care patients, who are at risk of developing an AHE within a future time span. Toward this objective, a pattern mining algorithm is employed to extract informative sequential contrast patterns from hemodynamic data, for the prediction of hypotensive episodes. The hypotensive and normotensive patient groups are extracted from the MIMIC-II critical care research database, following an appropriate clinical inclusion criteria. The proposed method consists of a data preprocessing step to convert the blood pressure time series into symbolic sequences, using a symbolic aggregate approximation algorithm. Then, distinguishing subsequences are identified using the sequential contrast mining algorithm. These subsequences are used to predict the occurrence of an AHE in a future time window separated by a user-defined gap interval. Results indicate that the method performs well in terms of the prediction performance as well as in the generation of sequential patterns of clinical significance. Hence, the novelty of sequential patterns is in their usefulness as potential physiological biomarkers for building optimal patient risk stratification systems and for further clinical investigation of interesting patterns in critical care patients.
Lan, C, Chen, Q & Li, J 2016, 'Grouping miRNAs of similar functions via weighted information content of gene ontology.', BMC Bioinformatics, vol. 17, no. Suppl 19, pp. 159-295.View/Download from: Publisher's site
BACKGROUND: Regulation mechanisms between miRNAs and genes are complicated. To accomplish a biological function, a miRNA may regulate multiple target genes, and similarly a target gene may be regulated by multiple miRNAs. Wet-lab knowledge of co-regulating miRNAs is limited. This work introduces a computational method to group miRNAs of similar functions to identify co-regulating miRNAsfrom a similarity matrix of miRNAs. RESULTS: We define a novel information content of gene ontology (GO) to measure similarity between two sets of GO graphs corresponding to the two sets of target genes of two miRNAs. This between-graph similarity is then transferred as a functional similarity between the two miRNAs. Our definition of the information content is based on the size of a GO term's descendants, but adjusted by a weight derived from its depth level and the GO relationships at its path to the root node or to the most informative common ancestor (MICA). Further, a self-tuning technique and the eigenvalues of the normalized Laplacian matrix are applied to determine the optimal parameters for the spectral clustering of the similarity matrix of the miRNAs. CONCLUSIONS: Experimental results demonstrate that our method has better clustering performance than the existing edge-based, node-based or hybrid methods. Our method has also demonstrated a novel usefulness for the function annotation of new miRNAs, as reported in the detailed case studies.
Li, J, Fong, S, Mohammed, S & Fiaidhi, J 2016, 'Improving the classification performance of biological imbalanced datasets by swarm optimization algorithms', JOURNAL OF SUPERCOMPUTING, vol. 72, no. 10, pp. 3708-3728.View/Download from: Publisher's site
Li, J, Fong, S, Mohammed, S, Fiaidhi, J, Chen, Q & Tan, Z 2016, 'Solving the Under-Fitting Problem for Decision Tree Algorithms by Incremental Swarm Optimization in Rare-Event Healthcare Classification', JOURNAL OF MEDICAL IMAGING AND HEALTH INFORMATICS, vol. 6, no. 4, pp. 1102-1110.View/Download from: Publisher's site
Li, J, Fong, S, Siu, S, Mohammed, S, Fiaidhi, J & Wong, KKL 2016, 'Improving classification of protein binders for virtual drug screening by novel swarm-based feature selection techniques', Computerized Medical Imaging and Graphics.View/Download from: Publisher's site
© 2016 Elsevier Ltd. Drug design involves classification of protein binding which is usually done in a computer simulation prior to extensive actual tests. Accurate classification of protein binding is essential but it is obstructed with a very challenging task of feature selection (FS) because there are too many potential features. Dorothea as a case of virtual screening in drug design, has 100,000 features that inflate to a very huge (of size 2100,000 possible candidate feature subsets to be selected) but very sparse search space. In this paper, this computational challenge is tackled by a new model of feature selection called Two-stage Swarm Search-FS (TSS-FS). The novelty of TSS-FS is the use of adaptive search space shrinking mechanism which is the first stage of the TSS-FS to reduce computing cost and increase classification accuracy. Reducing the very huge and sparse search space enables the swarm feature selection operate more efficiently. Results demonstrated in the paper confirms the efficacy of the new algorithms.
Li, J, Fong, S, Sung, Y, Cho, K, Wong, R & Wong, KKL 2016, 'Adaptive swarm cluster-based dynamic multi-objective synthetic minority oversampling technique algorithm for tackling binary imbalanced datasets in biomedical data classification', BIODATA MINING, vol. 9.View/Download from: Publisher's site
Liu, Q, Song, J & Li, J 2016, 'Using contrast patterns between true complexes and random subgraphs in PPI networks to predict unknown protein complexes.', Scientific Reports, vol. 6, pp. 1-15.View/Download from: Publisher's site
Most protein complex detection methods utilize unsupervised techniques to cluster densely connected nodes in a protein-protein interaction (PPI) network, in spite of the fact that many true complexes are not dense subgraphs. Supervised methods have been proposed recently, but they do not answer why a group of proteins are predicted as a complex, and they have not investigated how to detect new complexes of one species by training the model on the PPI data of another species. We propose a novel supervised method to address these issues. The key idea is to discover emerging patterns (EPs), a type of contrast pattern, which can clearly distinguish true complexes from random subgraphs in a PPI network. An integrative score of EPs is defined to measure how likely a subgraph of proteins can form a complex. New complexes thus can grow from our seed proteins by iteratively updating this score. The performance of our method is tested on eight benchmark PPI datasets and compared with seven unsupervised methods, two supervised and one semi-supervised methods under five standards to assess the quality of the predicted complexes. The results show that in most cases our method achieved a better performance, sometimes significantly.
Tee, AE, Liu, B, Song, R, Li, J, Pasquier, E, Cheung, BB, Jiang, C, Marshall, GM, Haber, M, Norris, MD, Fletcher, JI, Dinger, ME & Liu, T 2016, 'The long noncoding RNA MALAT1 promotes tumor-driven angiogenesis by up-regulating pro-angiogenic gene expression.', Oncotarget, vol. 7, no. 8, pp. 8663-8675.View/Download from: Publisher's site
Neuroblastoma is the most common solid tumor during early childhood. One of the key features of neuroblastoma is extensive tumor-driven angiogenesis due to hypoxia. However, the mechanism through which neuroblastoma cells drive angiogenesis is poorly understood. Here we show that the long noncoding RNA MALAT1 was upregulated in human neuroblastoma cell lines under hypoxic conditions. Conditioned media from neuroblastoma cells transfected with small interfering RNAs (siRNA) targeting MALAT1, compared with conditioned media from neuroblastoma cells transfected with control siRNAs, induced significantly less endothelial cell migration, invasion and vasculature formation. Microarray-based differential gene expression analysis showed that one of the genes most significantly down-regulated following MALAT1 suppression in human neuroblastoma cells under hypoxic conditions was fibroblast growth factor 2 (FGF2). RT-PCR and immunoblot analyses confirmed that MALAT1 suppression reduced FGF2 expression, and Enzyme-Linked Immunosorbent Assays revealed that transfection with MALAT1 siRNAs reduced FGF2 protein secretion from neuroblastoma cells. Importantly, addition of recombinant FGF2 protein to the cell culture media reversed the effects of MALAT1 siRNA on vasculature formation. Taken together, our data suggest that up-regulation of MALAT1 expression in human neuroblastoma cells under hypoxic conditions increases FGF2 expression and promotes vasculature formation, and therefore plays an important role in tumor-driven angiogenesis.
Wang, C, Dong, X, Han, L, Su, X-D, Zhang, Z, Li, J & Song, J 2016, 'Identification of WD40 repeats by secondary structure-aided profile-profile alignment.', Journal of theoretical biology, vol. 398, pp. 122-129.View/Download from: Publisher's site
A WD40 protein typically contains four or more repeats of ~40 residues ended with the Trp-Asp dipeptide, which folds into β-propellers with four β strands in each repeat. They often function as scaffolds for protein-protein interactions and are involved in numerous fundamental biological processes. Despite their important functional role, the "velcro" closure of WD40 propellers and the diversity of WD40 repeats make their identification a difficult task. Here we develop a new WD40 Repeat Recognition method (WDRR), which uses predicted secondary structure information to generate candidate repeat segments, and further employs a profile-profile alignment to identify the correct WD40 repeats from candidate segments. In particular, we design a novel alignment scoring function that combines dot product and BLOSUM62, thereby achieving a great balance of sensitivity and accuracy. Taking advantage of these strategies, WDRR could effectively reduce the false positive rate and accurately identify more remote homologous WD40 repeats with precise repeat boundaries. We further use WDRR to re-annotate the Pfam families in the β-propeller clan (CL0186) and identify a number of WD40 repeat proteins with high confidence across nine model organisms. The WDRR web server and the datasets are available at http://protein.cau.edu.cn/wdrr/.
Wang, C, Fang, Y & Li, JY 2016, 'Estimation of NON-WSSUS channel for OFDM system: Exploiting support correlations through a novel adaptive weighted predict-re-estimate L1 minimization approach', Journal of Communications, vol. 11, no. 2, pp. 149-156.View/Download from: Publisher's site
© 2016 Journal of Communications. It is challenging to estimate the wireless channel of the Orthogonal Frequency-Division Multiplexing (OFDM) broadband system under a changing communication environment. The difficulty is mainly attributed to this wireless channel’s Non Wide Sense Stationary Uncorrelated Scattering (Non-WSSUS) which has an implication that the delay and Doppler shift of such a channel are non-stationary and correlated. A Non-WSSUS channel is very different from the classical time-varying channel with constant delay and Doppler shift. In this paper, we propose an estimation method for the Non-WSSUS Channel Impulse Response (CIR) of the OFDM system. Based on the sparsity property of the delay-Doppler spread function, the delay and Doppler shift of Non-WSSUS channel can be extracted through a Compressive Sensing (CS) approach. Then a novel CS algorithm referred as Pre-Re L1 is proposed. The proposed CS algorithm exploits the correlations of the sparse supports to obtain adaptive weights for L1minimization. Numerical Simulation results show that the proposed CS method improves the performance of the Non-WSSUS wireless channel estimation.
Wang, X, Yan, R, Li, J & Song, J 2016, 'SOHPRED: a new bioinformatics tool for the characterization and prediction of human S-sulfenylation sites', MOLECULAR BIOSYSTEMS, vol. 12, no. 9, pp. 2849-2858.View/Download from: Publisher's site
Zheng, Y, Ji, B, Song, R, Wang, S, Li, T, Zhang, X, Chen, K, Li, T & Li, J 2016, 'Accurate detection for a wide range of mutation and editing sites of microRNAs from small RNA high-throughput sequencing profiles', NUCLEIC ACIDS RESEARCH, vol. 44, no. 14.View/Download from: Publisher's site
Ghosh, S & Li, J 2015, 'Using sequential patterns as features for classification models to make accurate predictions on ICU events', Conference proceedings : ... Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Annual Conference, vol. 2015, pp. 8157-8160.View/Download from: Publisher's site
Pattern mining algorithms have previously been utilized to extract informative rules in various clinical contexts. However, the number of generated patterns are numerous. In most cases, the extracted rules are directly investigated by clinicians for understanding disease diagnoses. The elicitation of important patterns for clinical investigation places a significant demand for precision and interpretability. Hence, it is essential to obtain a set of informative interpretable patterns for building advanced learning models about a patient's physiological condition, specially in critical care units. In this study, a two stage sequential contrast patterns based classification framework is presented, which is used to detect critical patient events like hypotension. In the first stage, we obtain a set of sequential patterns by using a contrast mining algorithm. These sequential patterns undergo post-processing, for conversion to binary valued and frequency based features for developing a classification model, in the second stage. Our results on eight critical care datasets demonstrate better predictive capabilities, when sequential patterns are used as features.
Hasan, MM, Zhou, Y, Lu, X, Li, J, Song, J & Zhang, Z 2015, 'Computational Identification of Protein Pupylation Sites by Using Profile-Based Composition of k-Spaced Amino Acid Pairs.', PLoS ONE, vol. 10, no. 6.View/Download from: Publisher's site
Prokaryotic proteins are regulated by pupylation, a type of post-translational modification that contributes to cellular function in bacterial organisms. In pupylation process, the prokaryotic ubiquitin-like protein (Pup) tagging is functionally analogous to ubiquitination in order to tag target proteins for proteasomal degradation. To date, several experimental methods have been developed to identify pupylated proteins and their pupylation sites, but these experimental methods are generally laborious and costly. Therefore, computational methods that can accurately predict potential pupylation sites based on protein sequence information are highly desirable. In this paper, a novel predictor termed as pbPUP has been developed for accurate prediction of pupylation sites. In particular, a sophisticated sequence encoding scheme [i.e. the profile-based composition of k-spaced amino acid pairs (pbCKSAAP)] is used to represent the sequence patterns and evolutionary information of the sequence fragments surrounding pupylation sites. Then, a Support Vector Machine (SVM) classifier is trained using the pbCKSAAP encoding scheme. The final pbPUP predictor achieves an AUC value of 0.849 in 10-fold cross-validation tests and outperforms other existing predictors on a comprehensive independent test dataset. The proposed method is anticipated to be a helpful computational resource for the prediction of pupylation sites. The web server and curated datasets in this study are freely available at http://protein.cau.edu.cn/pbPUP/.
Li, Z, He, Y, Wong, L & Li, J 2015, 'Burial Level Change Defines a High Energetic Relevance for Protein Binding Interfaces.', IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 12, no. 2, pp. 410-421.View/Download from: Publisher's site
Protein-protein interfaces defined through atomic contact or solvent accessibility change are widely adopted in structural biology studies. But, these definitions cannot precisely capture energetically important regions at protein interfaces. The burial depth of an atom in a protein is related to the atom's energy. This work investigates how closely the change in burial level of an atom/residue upon complexation is related to the binding. Burial level change is different from burial level itself. An atom deeply buried in a monomer with a high burial level may not change its burial level after an interaction and it may have little burial level change. We hypothesize that an interface is a region of residues all undergoing burial level changes after interaction. By this definition, an interface can be decomposed into an onion-like structure according to the burial level change extent. We found that our defined interfaces cover energetically important residues more precisely, and that the binding free energy of an interface is distributed progressively from the outermost layer to the core. These observations are used to predict binding hot spots. Our approach's F-measure performance on a benchmark dataset of alanine mutagenesis residues is much superior or similar to those by complicated energy modeling or machine learning approaches.
A binding hot spot is a small area at a protein-protein interface that can make significant contribution to binding free energy. This work investigates the substantial contribution made by some special co-occurring atomic contacts at a binding hot spot. A co-occurring atomic contact is a pair of atomic contacts that are close to each other with no more than three covalent-bond steps. We found that two kinds of co-occurring atomic contacts can play an important part in the accurate prediction of binding hot spot residues. One is the co-occurrence of two nearby hydrogen bonds. For example, mutations of any residue in a hydrogen bond network consisting of multiple co-occurring hydrogen bonds could disrupt the interaction considerably. The other kind of co-occurring atomic contact is the co-occurrence of a hydrophobic carbon contact and a contact between a hydrophobic carbon atom and a π ring. In fact, this co-occurrence signifies the collective effect of hydrophobic contacts. We also found that the B-factor measurements of several specific groups of amino acids are useful for the prediction of hot spots. Taking the B-factor, individual atomic contacts and the co-occurring contacts as features, we developed a new prediction method and thoroughly assessed its performance via cross-validation and independent dataset test. The results show that our method achieves higher prediction performance than well-known methods such as Robetta, FoldX and Hotpoint. We conclude that these contact descriptors, in particular the novel co-occurring atomic contacts, can be used to facilitate accurate and interpretable characterization of protein binding hot spots.
Liu, Q, Song, R & Li, J 2015, 'Inference of gene interaction networks using conserved subsequential patterns from multiple time course gene expression datasets.', BMC Genomics, vol. 16, no. Suppl 12, pp. 1-16.View/Download from: Publisher's site
MOTIVATION: Deciphering gene interaction networks (GINs) from time-course gene expression (TCGx) data is highly valuable to understand gene behaviors (e.g., activation, inhibition, time-lagged causality) at the system level. Existing methods usually use a global or local proximity measure to infer GINs from a single dataset. As the noise contained in a single data set is hardly self-resolved, the results are sometimes not reliable. Also, these proximity measurements cannot handle the co-existence of the various in vivo positive, negative and time-lagged gene interactions. METHODS AND RESULTS: We propose to infer reliable GINs from multiple TCGx datasets using a novel conserved subsequential pattern of gene expression. A subsequential pattern is a maximal subset of genes sharing positive, negative or time-lagged correlations of one expression template on their own subsets of time points. Based on these patterns, a GIN can be built from each of the datasets. It is assumed that reliable gene interactions would be detected repeatedly. We thus use conserved gene pairs from the individual GINs of the multiple TCGx datasets to construct a reliable GIN for a species. We apply our method on six TCGx datasets related to yeast cell cycle, and validate the reliable GINs using protein interaction networks, biopathways and transcription factor-gene regulations. We also compare the reliable GINs with those GINs reconstructed by a global proximity measure Pearson correlation coefficient method from single datasets. It has been demonstrated that our reliable GINs achieve much better prediction performance especially with much higher precision. The functional enrichment analysis also suggests that gene sets in a reliable GIN are more functionally significant. Our method is especially useful to decipher GINs from multiple TCGx datasets related to less studied organisms where little knowledge is available except gene expression data.
Song, R, Liu, Q, Liu, T & Li, J 2015, 'Connecting rules from paired miRNA and mRNA expression data sets of HCV patients to detect both inverse and positive regulatory relationships', BMC Genomics, vol. 16, no. Suppl 2.View/Download from: Publisher's site
Intensive research based on the inverse expression relationship has been undertaken to discover the miRNA-mRNA regulatory modules involved in the infection of Hepatitis C virus (HCV), the leading cause of chronic
liver diseases. However, biological studies in other fields have found that inverse expression relationship is not the only regulatory relationship between miRNAs and their targets, and some miRNAs can positively regulate a mRNA by binding at the 5’ UTR of the mRNA.
Xie, C, Zhang, J, Li, R, Li, J, Hong, P, Xia, J & Chen, P 2015, 'Automatic classification for field crop insects via multiple-task sparse representation and multiple-kernel learning', COMPUTERS AND ELECTRONICS IN AGRICULTURE, vol. 119, pp. 123-132.View/Download from: Publisher's site
Zhao, ZQ, Han, GS, Yu, ZG & Li, J 2015, 'Laplacian Normalization and Random Walk on Heterogeneous Networks for Disease-gene Prioritization', Computational Biology and Chemistry, vol. 57, pp. 21-28.View/Download from: Publisher's site
Random walk on heterogeneous networks is a recently emerging approach to effective disease gene prioritization. Laplacian normalization is a technique capable of normalizing the weight of edges in a network. We use this technique to normalize the gene matrix and the phenotype matrix before the construction of the heterogeneous network, and also use this idea to define the transition matrices of the heterogeneous network. Our method has remarkably better performance than the existing methods for recovering known gene–phenotype relationships. The Shannon information entropy of the distribution of the transition probabilities in our networks is found to be smaller than the networks constructed by the existing methods, implying that a higher number of top-ranked genes can be verified as disease genes. In fact, the most probable gene–phenotype relationships ranked within top 3 or top 5 in our gene lists can be confirmed by the OMIM database for many cases. Our algorithms have shown remarkably superior performance over the state-of-the-art algorithms for recovering gene–phenotype relationships. All Matlab codes can be available upon email request.
Song, R, Catchpoole, DR, Kennedy, PJ & Li, J 2015, 'Identification of lung cancer miRNA-miRNA co-regulation networks through a progressive data refining approach', JOURNAL OF THEORETICAL BIOLOGY, vol. 380, pp. 271-279.View/Download from: Publisher's site
Fong, S, Deb, S, Yang, X-S & Li, J 2014, 'Feature Selection in Life Science Classification: Metaheuristic Swarm Search', IT PROFESSIONAL, vol. 16, no. 4, pp. 24-29.View/Download from: Publisher's site
Li, C, Chen, P, Wang, R, Wang, X, Su, Y & Li, J 2014, 'PPI-IRO: A Two-Stage Method for Protein-Protein Interaction Extraction Based on Interaction Relation Ontology', International Journal of Data Mining and Bioinformatics, vol. 10, no. 1, pp. 98-119.View/Download from: Publisher's site
Liu, Q, Chen, YP & Li, J 2014, 'k-Partite cliques of protein interactions: A novel subgraph topology for functional coherence analysis on PPI networks', Journal of Theoretical Biology, vol. 340, pp. 146-154.View/Download from: Publisher's site
Liu, Q, Hoi, SC, Kwoh, C, Wong, L & Li, J 2014, 'Integrating water exclusion theory into beta contacts to predict binding free energy changes and binding hot spots', BMC Bioinformatics, vol. 15, no. 57.View/Download from: Publisher's site
Liu, Q, Li, Z & Li, J 2014, 'Use B-factor related features for accurate classification between protein binding interfaces and crystal packing contacts', BMC Bioinformatics, vol. 15, no. Suppl 16, pp. S3-S3.View/Download from: Publisher's site
Zhao, L, Hoi, SC, Li, Z, Wong, L, Nguyen, H & Li, J 2014, 'Coupling Graphs, Efficient Algorithms and B-cell Epitope Prediction', IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 11, no. 1, pp. 7-16.View/Download from: Publisher's site
Coupling graphs are newly introduced in this paper to meet many application needs particularly in the field of bioinformatics. A coupling graph is a two-layer graph complex, in which each node from one layer of the graph complex has at least one connection with the nodes in the other layer, and vice versa. The coupling graph model is sufficiently powerful to capture strong and inherent associations between subgraph pairs in complicated applications. The focus of this paper is on mining algorithms of frequent coupling subgraphs and bioinformatics application. Although existing frequent subgraph mining algorithms are competent to identify frequent subgraphs from a graph database, they perform poorly on frequent coupling subgraph mining because they generate many irrelevant subgraphs. We propose a novel graph transformation technique to transform a coupling graph into a generic graph. Based on the transformed coupling graphs, existing graph mining methods are then utilized to discover frequent coupling subgraphs. We prove that the transformation is precise and complete and that the restoration is reversible. Experiments carried out on a database containing 10,511 coupling graphs show that our proposed algorithm reduces the mining time very much in comparison with the existing subgraph mining algorithms. Moreover, we demonstrate the usefulness of frequent coupling subgraphs by applying our algorithm to make accurate predictions of epitopes in antibody-antigen binding
Zhou, Y, Tang, M, Pan, W, Li, J, Wang, W, Shao, J, Wu, L, Li, J, Yang, Q & Yan, B 2014, 'Bird Flu Outbreak Prediction via Satellite Tracking', IEEE Intelligent Systems, vol. 29, no. 4, pp. 10-17.View/Download from: Publisher's site
© 2001-2011 IEEE. Advanced satellite tracking technologies have collected huge amounts of wild bird migration data. Biologists use these data to understand dynamic migration patterns, study correlations between habitats, and predict global spreading trends of avian influenza. The research discussed here transforms the biological problem into a machine learning problem by converting wild bird migratory paths into graphs. H5N1 outbreak prediction is achieved by discovering weighted closed cliques from the graphs using the mining algorithm High-wEight cLosed cliquE miNing (HELEN). The learning algorithm HELEN-p then predicts potential H5N1 outbreaks at habitats. This prediction method is more accurate than traditional methods used on a migration dataset obtained through a real satellite bird-tracking system. Empirical analysis shows that H5N1 spreads in a manner of high-weight closed cliques and frequent cliques.
Song, R, Liu, Q, Hutvagner, G, Hung, N, Ramamohanarao, K, Wong, L & Li, J 2014, 'Rule discovery and distance separation to detect reliable miRNA biomarkers for the diagnosis of lung squamous cell carcinoma', BMC GENOMICS, vol. 15.View/Download from: Publisher's site
Ren, J, Ellis, JT & Li, J 2014, 'Influenza A HA's conserved epitopes and broadly neutralizing antibodies: a prediction method', Journal of Bioinformatics and Computational Biology, vol. 12, no. 5.View/Download from: Publisher's site
Ren, J, liu, Q, Ellis, J & Li, J 2014, 'Tertiary structure-based prediction of conformational B-cell epitopes through B factors', Bioinformatics, vol. 30, no. 12, pp. 264-273.View/Download from: Publisher's site
Motivation: B-cell epitope is a small area on the surface of an antigen that binds to an antibody. Accurately locating epitopes is of critical importance for vaccine development. Compared with wet-lab methods, computational methods have strong potential for efficient and large-scale epitope prediction for antigen candidates at much lower cost. However, it is still not clear which features are good determinants for accurate epitope prediction, leading to the unsatisfactory performance of existing prediction methods.
Method and results: We propose a much more accurate B-cell epitope prediction method. Our method uses a new feature B factor (obtained from X-ray crystallography), combined with other basic physicochemical, statistical, evolutionary and structural features of each residue. These basic features are extended by a sequence window and a structure window. All these features are then learned by a two-stage random forest model to identify clusters of antigenic residues and to remove isolated outliers. Tested on a dataset of 55 epitopes from 45 tertiary structures, we prove that our method significantly outperforms all three existing structure-based epitope predictors. Following comprehensive analysis, it is found that features such as B factor, relative accessible surface area and protrusion index play an important role in characterizing B-cell epitopes. Our detailed case studies on an HIV antigen and an influenza antigen confirm that our second stage learning is effective for clustering true antigenic residues and for eliminating self-made prediction errors introduced by the first-stage learning.
Chen, P, Li, J, Wong, L, Kuwahara, H, Huang, J & Gao, X 2013, 'Accurate prediction of hot spot residues through physicochemical characteristics of amino acid sequences', Proteins-Structure Function And Bioinformatics, vol. 81, no. 8, pp. 1351-1362.View/Download from: Publisher's site
Hot spot residues of proteins are fundamental interface residues that help proteins perform their functions. Detecting hot spots by experimental methods is costly and time-consuming. Sequential and structural information has been widely used in the compu
Li, Z, He, Y, Liu, Q, Zhao, L, Wong, L, Kwok, CY, Nguyen, HT & Li, J 2013, 'Structural analysis on mutation residues and interfacial water molecules for human TIM disease understanding', BMC Bioinformatics, vol. 14, no. S16, pp. 1-15.View/Download from: Publisher's site
Background Human triosephosphate isomerase (HsTIM) deficiency is a genetic disease caused often by the pathogenic mutation E104D. This mutation, located at the side of an abnormally large cluster of water in the inter-subunit interface, reduces the thermostability of the enzyme. Why and how these water molecules are directly related to the excessive thermolability of the mutant have not been investigated in structural biology. Results This work compares the structure of the E104D mutant with its wild type counterparts. It is found that the water topology in the dimer interface of HsTIM is atypical, having a "wet-core-dry-rim" distribution with 16 water molecules tightly packed in a small deep region surrounded by 22 residues including GLU104. These water molecules are co-conserved with their surrounding residues in non-archaeal TIMs (dimers) but not conserved across archaeal TIMs (tetramers), indicating their importance in preserving the overall quaternary structure. As the structural permutation induced by the mutation is not significant, we hypothesize that the excessive thermolability of the E104D mutant is attributed to the easy propagation of atoms' flexibility from the surface into the core via the large cluster of water. It is indeed found that the B factor increment in the wet region is higher than other regions, and, more importantly, the B factor increment in the wet region is maintained in the deeply buried core. Molecular dynamics simulations revealed that for the mutant structure at normal temperature, a clear increase of the root-mean-square deviation is observed for the wet region contacting with the large cluster of interfacial water. Such increase is not observed for other interfacial regions or the whole protein. This clearly suggests that, in the E104D mutant, the large water cluster is responsible for the subunit interface flexibility and overall thermolability, and it ultimately leads to the deficiency of this enzyme.
Liu, Q, Kwok, CY & Li, J 2013, 'Binding Affinity Prediction for Protein-Ligand Complexes Based on ß Contacts and B Factor', Journal of Chemical Information and Modeling, vol. 53, no. 11, pp. 3076-3085.View/Download from: Publisher's site
Accurate determination of proteinligand binding affinity is a fundamental problem in biochemistry useful for many applications including drug design and proteinligand docking. A number of scoring functions have been proposed for the prediction of proteinligand binding affinity. However, accurate prediction is still a challenging problem because poor performance is often seen in the evaluation under the leave-one-cluster-out cross-validation (LCOCV). We introduce a new scoring function named B2BScore to improve the prediction performance. B2BScore integrates two physicochemical properties for proteinligand binding affinity prediction. One is the property of ß contacts. A ß contact between two atoms requires no other atoms to interrupt the atomic contact and assumes that the two atoms should have enough direct contact area. The other is the property of B factor to capture the atomic mobility in the dynamic proteinligand binding process.
Chen, P, Wong, L & Li, J 2012, 'Detection Of Outlier Residues For Improving Interface Prediction In Protein Heterocomplexes', Ieee-Acm Transactions On Computational Biology And Bioinformatics, vol. 9, no. 4, pp. 1155-1165.View/Download from: Publisher's site
Sequence-based understanding and identification of protein binding interfaces is a challenging research topic due to the complexity in protein systems and the imbalanced distribution between interface and noninterface residues. This paper presents an out
Li, Y & Li, J 2012, 'Disease gene identification by random walk on multigraphs merging heterogeneous genomic and phenotype data', BMC Genomics, vol. 13, no. suppl 7, pp. 1-12.View/Download from: Publisher's site
Background High throughput experiments resulted in many genomic datasets and hundreds of candidate disease genes. To discover the real disease genes from a set of candidate genes, computational methods have been proposed and worked on various types of genomic data sources. As a single source of genomic data is prone of bias, incompleteness and noise, integration of different genomic data sources is highly demanded to accomplish reliable disease gene identification. Results In contrast to the commonly adapted data integration approach which integrates separate lists of candidate genes derived from the each single data sources, we merge various genomic networks into a multigraph which is capable of connecting multiple edges between a pair of nodes. This novel approach provides a data platform with strong noise tolerance to prioritize the disease genes. A new idea of random walk is then developed to work on multigraphs using a modified step to calculate the transition matrix. Our method is further enhanced to deal with heterogeneous data types by allowing cross-walk between phenotype and gene networks. Compared on benchmark datasets, our method is shown to be more accurate than the state-of-the-art methods in disease gene identification. We also conducted a case study to identify disease genes for Insulin-Dependent Diabetes Mellitus. Some of the newly identified disease genes are supported by recently published literature.
Li, Z, He, Y, Cao, L, Wong, L & Li, J 2012, 'Conservation of water molecules in protein binding interfaces', International Journal of Bioinformatics Research and Applications, vol. 8, no. 3/4, pp. 228-244.View/Download from: Publisher's site
The conservation of interfacial water molecules has only been studied in small data sets consisting of interfaces of a specific function. So far, no general conclusions have been drawn from largescale analysis, due to the challenges of using structural alignment in large data sets. To avoid using structural alignment, we propose a solvated sequence method to analyse water conservation properties in protein interfaces. We first use water information to label the residues, and then align interfacial residues in a fashion similar to normal sequence alignment. Our results show that, for a watercontacting interfacial residue, substituting it into hydrophobic residues tends to desolvate the local area. Surprisingly, residues with short side chains also tend not to lose their contacting water, emphasising the role of water in shaping binding sites. Deeply buried water molecules are found more conserved in terms of their contacts with interfacial residues
Li, Z, He, Y, Wong, L & Li, J 2012, 'Progressive dry-core-wet-rim hydration trend in a nested-ring topology of protein binding interfaces', BMC Bioinformatics, vol. 13.View/Download from: Publisher's site
Background Water is an integral part of protein complexes. It shapes protein binding sites by filling cavities and it bridges local contacts by hydrogen bonds. However, water molecules are usually not included in protein interface models in the past, and few distribution profiles of water molecules in protein binding interfaces are known. Results In this work, we use a tripartite protein-water-protein interface model and a nested-ring atom re-organization method to detect hydration trends and patterns from an interface data set which involves immobilized interfacial water molecules. This data set consists of 206 obligate interfaces, 160 non-obligate interfaces, and 522 crystal packing contacts. The two types of biological interfaces are found to be drier than the crystal packing interfaces in our data, agreeable to a hydration pattern reported earlier although the previous definition of immobilized water is pure distance-based. The biological interfaces in our data set are also found to be subject to stronger water exclusion in their formation. To study the overall hydration trend in protein binding interfaces, atoms at the same burial level in each tripartite protein-water-protein interface are organized into a ring. The rings of an interface are then ordered with the core atoms placed at the middle of the structure to form a nested-ring topology. We find that water molecules on the rings of an interface are generally configured in a dry-core-wet-rim pattern with a progressive level-wise solvation towards to the rim of the interface. This solvation trend becomes even sharper when counterexamples are separated.
Liu, Q, Wong, L & Li, J 2012, 'Z-score biological significance of binding hot spots of protein interfaces by using crystal packing as the reference state', BBA - Proteins and Proteomics, vol. 1824, no. 12, pp. 1457-1467.View/Download from: Publisher's site
Characterization of binding hot spots of protein interfaces is a fundamental study in molecular biology. Many computational methods have been proposed to identify binding hot spots. However, there are few studies to assess the biological significance of binding hot spots. We introduce the notion of biological significance of a contact residue for capturing the probability of the residue occurring in or contributing to protein binding interfaces. We take a statistical Z-score approach to the assessment of the biological significance. The method has three main steps. First, the potential score of a residue is defined by using a knowledge-based potential function with relative accessible surface area calculations. A null distribution of this potential score is then generated from artifact crystal packing contacts. Finally, the Z-score significance of a contact residue with a specific potential score is determined according to this null distribution. We hypothesize that residues at binding hot spots have big absolute values of Z-score as they contribute greatly to binding free energy. Thus, we propose to use Z-score to predict whether a contact residue is a hot spot residue. Comparison with previously reported methods on two benchmark datasets shows that this Z-score method is mostly superior to earlier methods. This article is part of a Special Issue entitled: Computational Methods for Protein Interaction and Structural Prediction.
A multi-interface domain is a domain that can shape multiple and distinctive binding sites to contact with many other domains, forming a hub in domain-domain interaction networks. The functions played by the multiple interfaces are usually different, but there is no strict bijection between the functions and interfaces as some subsets of the interfaces play the same function. This work applies graph theory and algorithms to discover fingerprints for the multiple interfaces of a domain and to establish associations between the interfaces and functions, based on a huge set of multi-interface proteins from PDB. We found that about 40% of proteins have the multi-interface property, however the involved multi-interface domains account for only a tiny fraction (1.8%) of the total number of domains. The interfaces of these domains are distinguishable in terms of their fingerprints, indicating the functional specificity of the multiple interfaces in a domain. Furthermore, we observed that both cooperative and distinctive structural patterns, which will be useful for protein engineering, exist in the multiple interfaces of a domain
Background Prediction of B-cell epitopes from antigens is useful to understand the immune basis of antibody-antigen recognition, and is helpful in vaccine design and drug development. Tremendous efforts have been devoted to this long-studied problem, however, existing methods have at least two common limitations. One is that they only favor prediction of those epitopes with protrusive conformations, but show poor performance in dealing with planar epitopes. The other limit is that they predict all of the antigenic residues of an antigen as belonging to one single epitope even when multiple non-overlapping epitopes of an antigen exist. Results In this paper, we propose to divide an antigen surface graph into subgraphs by using a Markov Clustering algorithm, and then we construct a classifier to distinguish these subgraphs as epitope or non-epitope subgraphs. This classifier is then taken to predict epitopes for a test antigen. On a big data set comprising 92 antigen-antibody PDB complexes, our method significantly outperforms the state-of-the-art epitope prediction methods, achieving 24.7% higher averaged f-score than the best existing models. In particular, our method can successfully identify those epitopes with a non-planarity which is too small to be addressed by the other models. Our method can also detect multiple epitopes whenever they exist.
Li, Z, Wong, L & Li, J 2011, 'DBAC: A Simple Prediction Method For Protein Binding Hot Spots Based On Burial Levels And Deeply Buried Atomic Contacts', BMC SYSTEMS BIOLOGY, vol. 5, no. S1, pp. 1-11.View/Download from: Publisher's site
Background: A protein binding hot spot is a cluster of residues in the interface that are energetically important for the binding of the protein with its interaction partner. Identifying protein binding hot spots can give useful information to protein en
Liu, Q, Hoi, S, Su, C, Li, Z, Kwoh, C, Wong, L & Li, J 2011, 'Structural Analysis Of The Hot Spots In The Binding Between H1N1 HA And The 2d1 Antibody: Do Mutations Of H1N1 From 1918 To 2009 Affect Much On This Binding?', Bioinformatics, vol. 27, no. 18, pp. 2529-2536.View/Download from: Publisher's site
Motivation: Worldwide and substantial mortality caused by the 2009 H1N1 influenza A has stimulated a new surge of research on H1N1 viruses. An epitope conservation has been learned in the HA1 protein that allows antibodies to cross-neutralize both 1918 a
Lo, D, Li, J, Wong, L & Khoo, S 2011, 'Mining Iterative Generators And Representative Rules For Software Specification Discovery', Ieee Transactions On Knowledge And Data Engineering, vol. 23, no. 2, pp. 282-296.View/Download from: Publisher's site
Billions of dollars are spent annually on software-related cost. It is estimated that up to 45 percent of software cost is due to the difficulty in understanding existing systems when performing maintenance tasks (i.e., adding features, removing bugs, et
Sim, K, Liu, G, Gopalkrishnan, V & Li, J 2011, 'A Case Study On Financial Ratios Via Cross-graph Quasi-bicliques', Information Sciences, vol. 181, no. 1, pp. 201-216.View/Download from: Publisher's site
Stocks with similar financial ratio values across years have similar price movements. We investigate this hypothesis by clustering groups of stocks that exhibit homogeneous financial ratio values across years, and then study their price movements. We pro
Tang, M, Zhou, Y, Li, J, Wang, W, Cui, P, Hou, Y, Luo, Z, Li, J, Lei, F & Yan, B 2011, 'Exploring The Wild Birds' Migration Data For The Disease Spread Study Of H5N1: A Clustering And Association Approach', Knowledge And Information Systems, vol. 27, no. 2, pp. 227-251.View/Download from: Publisher's site
Knowledge about the wetland use of migratory bird species during the annual life circle is very interesting to biologists, as it is critically important in many decision-making processes such as for conservation site construction and avian influenza control. The raw data of the habitat areas and the migration routes are usually in large scale and with high complexity when they are determined by high-tech GPS satellite telemetry. In this paper, we convert these biological problems into computational studies and introduce efficient algorithms for the data analysis. Our key idea is the concept of hierarchical clustering for migration habitat localizations, and the notion of association rules for the discovery of migration routes from the scattered location points in the GIS. One of our clustering results is a tree structure, specially called spatial-tree, which is an illusive map depicting the breeding and wintering home range of bar-headed geese. A related result to this observation is an association pattern that reveals a high possibility that bar-headed geese's potential autumn migration routes are likely between the breeding sites in the Qinghai Lake, China and the wintering sites in Tibet river valley. Given the susceptibility of geese to spread H5N1, and on the basis of the chronology and the rates of the bar-headed geese migration movements, we can conjecture that bar-headed geese play an important role in the spread of the H5N1 virus at a regional scale in Qinghai-Tibetan Plateau.
Zeng, T, Li, J & Liu, J 2011, 'Distinct Interfacial Biclique Patterns Between Ssdna-binding Proteins And Those With Dsdnas', Proteins-structure Function And Bioinformatics, vol. 79, no. 2, pp. 598-610.View/Download from: Publisher's site
We introduce a new motif called interfacial biclique pattern to study the difference between double-stranded DNA-binding proteins (DSBs, most of them also known to play the role as transcriptional factors) and single-stranded DNA-binding proteins (SSBs)
Zhao, L, Wong, L & Li, J 2011, 'Antibody-specified B-cell Epitope Prediction In Line With The Principle Of Context-awareness', IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 8, no. 6, pp. 1483-1494.View/Download from: Publisher's site
Context-awareness is a characteristic in the recognition between antigens and antibodies, highlighting the reconfiguration of epitope residues when an antigen interacts with a different antibody. A coarse binary classification of antigen regions into epi
Anandagopu, P, Rashid, S & Li, J 2010, 'Low Thymine Content in PINK1 mRNAs and Insights into Parkinson's disease', Bioinformation, vol. 4, no. 10, pp. 452-455.
Thymine is the only nucleotide base which is changed to uracil upon transcription, leaving mRNA less hydrophobic compared to its DNA counterpart. All the 16 codons that contain uracil (or thymine in gene) as the second nucleotide code for the five large hydrophobic residues (LHRs), namely phenylalanine,v isoleucine, leucine, methionine and valine. Thymine content (i.e. the fraction of XTX codons, where X = A, C, G, or T) in PINK1 mRNA sequences and its relationship with protein stability and function are the focus of this work. This analysis will shed light on PINK1's stability, thus a clue can be provided to understand the mitochondrial dysfunction and the failure of oxidative stress control frequently observed in Parkinson's disease. We obtained the complete PINK1 mRNA sequences of 8 different species. The distributions of XTX codons in different frames are calculated. We observed that the thymine content reached the highest level in the coding frame 1 of the PINK1 mRNA sequence of Bos Taurus (Bt), that is peaked at 27%. Coding frame 1 containing low thymine leads to the reduction in LHRs in the corresponding proteins. Therefore, we conjecture that proteins from the other organisms, including Homo sapiens, lost some of their hydrophobicity and became susceptible to dysfunction. Genes such as PINK1 have reduced thymine in the evolutionary process thereby making their protein products potentially being susceptible to instability and causing disease. Adding more hydrophobic residues (thymine) at appropriate places might help conserve important biological functions.
Chen, P & Li, J 2010, 'Prediction of protein long-range contacts using an ensemble of genetic algorithm classifiers with sequence profile centers', Bmc Structural Biology, vol. 10, no. Suppl. 1, pp. 1-13.View/Download from: Publisher's site
Background: Prediction of long-range inter-residue contacts is an important topic in bioinformatics research. It is helpful for determining protein structures, understanding protein foldings, and therefore advancing the annotation of protein functions. R
Chen, P & Li, J 2010, 'Sequence-based Identification Of Interface Residues By An Integrative Profile Combining Hydrophobic And Evolutionary Information', Bmc Bioinformatics, vol. 11, pp. 0-0.View/Download from: Publisher's site
Background: Protein-protein interactions play essential roles in protein function determination and drug design. Numerous methods have been proposed to recognize their interaction sites, however, only a small proportion of protein complexes have been suc
Chen, P, Liu, C, Burge, L, Li, J, Mohammad, M, Southerland, W, Gloster, C & Wang, B 2010, 'DomSVR: Domain Boundary Prediction With Support Vector Regression From Sequence Information Alone', Amino Acids, vol. 39, no. 3, pp. 713-726.View/Download from: Publisher's site
Protein domains are structural and fundamental functional units of proteins. The information of protein domain boundaries is helpful in understanding the evolution, structures and functions of proteins, and also plays an important role in protein classif
Feng, M, Dong, G, Li, J, Tan, Y & Wong, L 2010, 'Pattern Space Maintenance For Data Updates And Interactive Mining', Computational Intelligence, vol. 26, no. 3, pp. 282-317.View/Download from: Publisher's site
This article addresses the incremental and decremental maintenance of the frequent pattern space. We conduct an in-depth investigation on how the frequent pattern space evolves under both incremental and decremental updates. Based on the evolution analys
Li, Z & Li, J 2010, 'Geometrically centered region: A "wet" model of protein binding hot spots not excluding water molecules', Proteins-Structure Function And Bioinformatics, vol. 78, no. 16, pp. 3304-3316.View/Download from: Publisher's site
A protein interface can be as 'wet' as a protein surface in terms of the number of immobilized water molecules. This important water information has not been explicitly taken by computational methods to model and identify protein binding hot spots, overl
Liu, Q & Li, J 2010, 'Propensity Vectors Of Low-ASA Residue Pairs In The Distinction Of Protein Interactions', Proteins-structure Function And Bioinformatics, vol. 78, no. 3, pp. 589-602.View/Download from: Publisher's site
We introduce low-ASA residue pairs as classification features for distinguishing the different types of protein interactions. A low-ASA residue pair is defined as two contact residues each from one chain that have a small solvent accessible surface area
Liu, X, Li, J & Wang, L 2010, 'Modeling Protein Interacting Groups By Quasi-bicliques: Complexity, Algorithm, And Application', IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 7, no. 2, pp. 354-364.View/Download from: Publisher's site
Protein-protein interactions (PPIs) are one of the most important mechanisms in cellular processes. To model protein interaction sites, recent studies have suggested to find interacting protein group pairs from large PPI networks at the first step and th
LUO, F, LIU, J & Li, J 2010, 'Discovering conditional co-regulated protein complexes by integrating diverse data sources', BMC Systems Biology, vol. 4, no. Suppl. 2, pp. 1-13.View/Download from: Publisher's site
Background: Proteins interacting with each other as a complex play an important role in many molecular processes and functions. Directly detecting protein complexes is still costly, whereas many protein-protein interaction (PPI) maps for model organisms are available owing to the fast development of high-throughput PPI detecting techniques. These binary PPI data provides fundamental and abundant information for inferring new protein complexes. However, PPI data from different experiments do not overlap very much usually. The main reason is that the functions of proteins can activate only on certain environment or stimulus. In a short, PPI is condition-specific. Therefore specifying the conditions on when complexes are present is necessary for a deep understanding of their behaviours. Meanwhile, proteins have various interaction ways and control mechanisms to form different kinds of complexes. Thus the discovery of a certain type of complexes should depend on their own distinct biological or topological characteristics. We do not attempt to find all kinds of complexes by using certain features. Here, we integrate transcription regulation data (TR), gene expression data (GE) and protein-protein interaction data at the systems biology level to discover a special kind of protein complex called conditional coregulated protein complexes. A conditional co-regulated protein complex has three remarkable features: the coding genes of the member proteins share the same transcription factor (TF), under a certain condition the coding genes express co-ordinately and the member proteins interact mutually as a complex to implement a common biological function
Mann, S, Li, J & Chen, Y 2010, 'Insights Into Bacterial Genome Composition Through Variable Target GC Content Profiling', Journal of Computational Biology, vol. 17, no. 1, pp. 79-96.View/Download from: Publisher's site
This study presents a new computational method for guanine (G) and cytosine (C), or GC, content profiling based on the idea of multiple resolution sampling (MRS). The benefit of our new approach over existing techniques follows from its ability to locate
Zeng, T & Li, J 2010, 'Maximization of negative correlations in time-course gene expression data for enhancing understanding of molecular pathways', NUCLEIC ACIDS RESEARCH, vol. 38, no. 1.View/Download from: Publisher's site
Zhao, L & Li, J 2010, 'Mining For The Antibody-antigen Interacting Associations That Predict The B Cell Epitopes', Bmc Structural Biology, vol. 10, no. Suppl.1, pp. 1-13.View/Download from: Publisher's site
Background: Predicting B-cell epitopes is very important for designing vaccines and drugs to fight against the infectious agents. However, due to the high complexity of this problem, previous prediction methods that focus on linear and conformational epi
Li, J & Liu, Q 2009, ''Double water exclusion': A hypothesis refining the O-ring theory for the hot spots at protein interfaces', Bioinformatics, vol. 25, no. 6, pp. 743-750.View/Download from: Publisher's site
Motivation: The O-ring theory reveals that the binding hot spot at a protein interface is surrounded by a ring of residues that are energetically less important than the residues in the hot spot. As this ring of residues is served to occlude water molecu
Traditional similarity measurements often become meaningless when dimensions of datasets increase. Subspace clustering has been proposed to find clusters embedded in subspaces of high-dimensional datasets. Many existing algorithms use a grid-based approach to partition the data space into nonoverlapping rectangle cells, and then identify connected dense cells as clusters. The rigid boundaries of the grid-based approach may cause a real cluster to be divided into several small clusters. In this paper, we propose to use a sliding-window approach to partition the dimensions to preserve significant clusters. We call this model nCluster model. The sliding-window approach generates more bins than the grid-based approach, thus it incurs higher mining cost. We develop a deterministic algorithm, called MaxnCluster, to mine nClusters efficiently. MaxnCluster uses several techniques to speed up the mining, and it produces only maximal nClusters to reduce result size. Non-maximal nClusters are pruned without the need of storing the discovered nClusters in the memory, which is key to the efficiency of MaxnCluster. Our experiment results show that (i) the nCluster model can indeed preserve clusters that are shattered by the grid-based approach on synthetic datasets; (ii) the nCluster model produces more significant clusters than the grid-based approach on two real gene expression datasets and (iii) MaxnCluster is efficient in mining maximal nClusters.
Sim, K, Li, J, Gopalkrishnan, V & Liu, G 2009, 'Mining maximal quasi-bicliques: Novel algorithm and applications in the stock market and protein networks', Statistical Analysis and Data Mining, vol. 2, no. 4, pp. 255-273.View/Download from: Publisher's site
Several real-world applications require mining of bicliques, as they represent correlated pairs of data clusters. However, the mining quality is adversely affected by missing and noisy data. Moreover, some applications only require strong interactions between data members of the pairs, but bicliques are pairs that display complete interactions. We address these two limitations by proposing maximal quasi-bicliques. Maximal quasi-bicliques tolerate erroneous and missing data, and also relax the interactions between the data members of their pairs. Besides, maximal quasi-bicliques do not suffer from skewed distribution of missing edges that prior quasi-bicliques have. We develop an algorithm MQBminer, which mines the complete set of maximal quasi-bicliques from either bipartite or non-bipartite graphs. We demonstrate the versatility and effectiveness of maximal quasi-bicliques to discover highly correlated pairs of data in two diverse real-world datasets. First, we propose to solve a novel financial stocks analysis problem using maximal quasi-bicliques to co-cluster stocks and financial ratios. Results show that the stocks in our co-clusters usually have significant correlations in their price performance. Second, we use maximal quasi-bicliques on a mining protein network problem and we show that pairs of protein groups mined by maximal quasi-bicliques are more significant than those mined by maximal bicliques.
Zeng, X, Pei, J, Wang, K & Li, J 2009, 'PADS: A simple yet effective pattern-aware dynamic search method for fast maximal frequent pattern mining', Knowledge And Information Systems, vol. 20, no. 3, pp. 375-391.View/Download from: Publisher's site
While frequent pattern mining is fundamental for many data mining tasks, mining maximal frequent patterns efficiently is important in both theory and applications of frequent pattern mining. The fundamental challenge is how to search a large space of ite
Kim, S, Kang, J, Chung, Y, Li, J & Ryu, K 2008, 'Clustering Orthologous Proteins Across Phylogenetically Distant Species', Proteins-Structure Function And Bioinformatics, vol. 71, no. 3, pp. 1113-1122.View/Download from: Publisher's site
The quality of orthologous protein clusters (OPCs) is largely dependent on the results of the reciprocal BLAST (basic local alignment search tool) hits among genomes. The BLAST algorithm is very efficient and fast, but it is very difficult to get optimal
Liu, G, Li, J & Wong, L 2008, 'A New Concise Representation Of Frequent Itemsets Using Generators And A Positive Border', Knowledge And Information Systems, vol. 17, no. 1, pp. 35-56.View/Download from: Publisher's site
A complete set of frequent itemsets can get undesirably large due to redundancy when the minimum support threshold is low or when the database is dense. Several concise representations have been previously proposed to eliminate the redundancy. Generator
Liu, G, Li, J & Wong, L 2008, 'Assessing and Predicting Protein Interactions Using Both Local and Global Network Topological Metrics', GENOME INFORMATICS 2008, VOL 21, vol. 21, pp. 138-+.
Aung, Z & Li, J 2007, 'Mining super-secondary structure motifs from 3D protein structures: A sequence order independent approach', GENOME INFORMATICS 2007, VOL 19, vol. 19, pp. 15-+.
Li, J & Yang, Q 2007, 'Strong Compound-risk Factors: Efficient Discovery Through Emerging Patterns And Contrast Sets', IEEE Transactions on Information Technology in Biomedicine, vol. 11, no. 5, pp. 544-552.View/Download from: Publisher's site
Odds ratio (OR), relative risk (RR) (risk ratio), and absolute risk reduction (ARR) (risk difference) are biostatistics measurements that are widely used for identifying significant risk factors in dichotomous groups of subjects. In the past, they have o
Li, J, Liu, G, Li, H & Wong, L 2007, 'Maximal Biclique Subgraphs And Closed Pattern Pairs Of The Adjacency Matrix: A One-to-one Correspondence And Mining Algorithms', IEEE Transactions On Knowledge And Data Engineering, vol. 19, no. 12, pp. 1625-1636.View/Download from: Publisher's site
Maximal biclique (also known as complete bipartite) subgraphs can model many applications in Web mining, business, and bioinformatics. Enumerating maximal biclique subgraphs from a graph is a computationally challenging problem, as the size of the output
Li, J, Liu, G, Li, H & Wong, L 2007, 'Maximal biclique subgraphs and closed pattern pairs of the adjacency matrix: A one-to-one correspondence and mining algorithms', IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, vol. 19, no. 12, pp. 1625-1637.View/Download from: Publisher's site
Mann, S, Li, J & Chen, Y 2007, 'A pHMM-ANN based discriminative approach to promoter identification in prokaryote genomic contexts', Nucleic Acids Research, vol. 35, no. 2, pp. 1-7.
The computational approach for identifying promoters on increasingly large genomic sequences has led to many false positives. The biological significance of promoter identification lies in the ability to locate true promoters with and without prior seque
Mann, S, Li, J & Chen, Y-PP 2007, 'A pHMM-ANN based discriminative approach to promoter identification in prokaryote genomic contexts', Nucleic Acids Research, vol. 35, no. 2, pp. e12-e12.View/Download from: Publisher's site
Pang, B, Kuralmani, V, Joshi, R, Yin, H, Lee, K, Ang, B, Li, J, Leong, T & Ng, I 2007, 'Hybrid Outcome Prediction Model For Severe Traumatic Brain Injury', Journal Of Neurotrauma, vol. 24, no. 1, pp. 136-146.View/Download from: Publisher's site
Numerous studies addressing different methods of head injury prognostication have been published. Unfortunately, these studies often incorporate different head injury prognostication models and study populations, thus making direct comparison difficult,
Li, H, Li, J & Wong, L 2006, 'Discovering Motif Pairs At Interaction Sites From Protein Sequences On A Proteome-wide Scale', Bioinformatics, vol. 22, no. 8, pp. 989-996.View/Download from: Publisher's site
Motivation: Protein-protein interaction, mediated by protein interaction sites, is intrinsic to many functional processes in the cell. In this paper, we propose a novel method to discover patterns in protein interaction sites. We observed from protein in
The mining of changes or differences or other comparative patterns from a pair of datasets is an interesting problem. This paper is focused on the mining of one type of comparative pattern called emerging patterns. Emerging patterns are denoted by EPs an
Huang, D, Chow, T, Ma, E & Li, J 2005, 'Efficient Selection Of Discriminative Genes From Microarray Gene Expression Data For Cancer Diagnosis', IEEE Transaction On Circuits and Systems -I: Fundamental Theory and Applications, vol. 52, no. 9, pp. 1909-1918.
A new mutual information (MI)-based feature-selection method to solve the so-called large p and small n problem experienced in a microarray gene expression-based data is presented. First, a grid-based feature clustering algorithm is introduced to elimina
Li, H & Li, J 2005, 'Discovery Of Stable And Significant Binding Motif Pairs From PDB Complexes And Protein Interaction Datasets', Bioinformatics, vol. 21, no. 3, pp. 314-324.View/Download from: Publisher's site
Motivation: Discovery of binding sites is important in the study of protein-protein interactions. In this paper, we introduce stable and significant motif pairs to model protein-binding sites. The stability is the pattern's resistance to some transformat
Li, J & Li, H 2005, 'Using Fixed Point Theorems To Model The Binding In Protein-protein Interactions', IEEE Transactions On Knowledge And Data Engineering, vol. 17, no. 8, pp. 1079-1087.View/Download from: Publisher's site
The binding in protein-protein interactions exhibits a kind of biochemical stability in cells. The mathematical notion of fixed points also describes stability. A point is a fixed point if it remains unchanged after a transformation by a function. Many p
Li, J & Wong, L 2005, 'Structural Geography Of The Space Of Emerging Patterns', Intelligent Data Analysis, vol. 9, no. 6, pp. 567-588.
Describing and capturing significant differences between two classes of data is an important data mining and classification research topic. In this paper, we use emerging patterns to describe these significant differences. Such a pattern occurs in one cl
Li, JY, Wong, LS & Yang, Q 2005, 'Data mining in bioinformatics', IEEE INTELLIGENT SYSTEMS, vol. 20, no. 6, pp. 16-18.
Liu, H, Han, H, Li, J & Wong, L 2005, 'DNAFSMiner: A Web-based Software Toolbox To Recognize Two Types Of Functional Sites In DNA Sequences', Bioinformatics, vol. 21, no. 5, pp. 671-673.View/Download from: Publisher's site
DNAFSMiner (DNA Functional Sites Miner) is a web-based software toolbox to recognize functional sites in nucleic acid sequences. Currently in this toolbox, we provide two software: TIS Miner and Poly(A) Signal Miner. The TIS Miner can be used to predict
Motivation: Patient outcome prediction using microarray technologies is an important application in bioinformatics. Based on patients' genotypic microarray data, predictions are made to estimate patients' survival time and their risk of tumor metastasis
Li, H, Li, J, Tan, SH & Ng, SK 2004, 'Discovery of binding motif pairs from protein complex structural data and protein interaction sequence data.', Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing, pp. 312-323.
Unravelling the underlying mechanisms of protein interactions requires knowledge about the interactions' binding sites. In this paper, we use a novel concept, binding motif pairs, to describe binding sites. A binding motif pair consists of two motifs each derived from one side of the binding protein sequences. The discovery is a directed approach that uses a combination of two data sources: 3-D structures of protein complexes and sequences of interacting proteins. We first extract maximal contact segment pairs from the protein complexes' structural data. We then use these segment pairs as templates to sub-group the interacting protein sequence dataset, and conduct an iterative refinement to derive significant binding motif pairs. This combination approach is efficient in handling large datasets of protein interactions. From a dataset of 78,390 protein interactions, we have discovered 896 significant binding motif pairs. The discovered motif pairs include many novel motif pairs as well as motifs that agree well with experimentally validated patterns in the literature.
Li, J & Ong, H 2004, 'Feature Space Transformation For Better Understanding Biological And Medical Classifications', Journal Of Research And Practice In Information Technology, vol. 36, no. 3, pp. 131-144.
Recently published gene expression profiles and proteomic mass/charge ratios are extremely high-dimensional data. Though support vector machines can well learn the inner relationship of the data for classification, the non-linear kernel functions pose an
Li, J, Dong, G, Ramamohanarao, K & Wong, L 2004, 'Deeps: A New Instance-based Lazy Discovery And Classification System', Machine Learning, vol. 54, no. 2, pp. 99-124.View/Download from: Publisher's site
Distance is widely used in most lazy classification systems. Rather than using distance, we make use of the frequency of an instance's subsets of features and the frequency-change rate of the subsets among training classes to perform both knowledge disco
Li, J, Manoukian, T, Dong, G & Ramamohanarao, K 2004, 'Incremental Maintenance On The Border Of The Space Of Emerging Patterns', Data Mining And Knowledge Discovery, vol. 9, no. 1, pp. 89-116.View/Download from: Publisher's site
Emerging patterns (EPs) are useful knowledge patterns with many applications. In recent studies on bio-medical profiling data, we have successfully used such patterns to solve difficult cancer diagnosis problems and produced higher classification accurac
Liu, H, Han, H, Li, J & Wong, L 2004, 'Using Amino Acid Patterns to Accurately Predict Translation Initiation Sites', In silico Biology, vol. 4, no. 22, pp. 1-11.
The translation initiation site (TIS) prediction problem is about how to correctly identify TIS in mRNA, cDNA, or other types of genomic sequences. High prediction accuracy can be helpful in a better understanding of protein coding from nucleotide sequences. This is an important step in genomic analysis to determine protein coding from nucleotide sequences. In this paper, we present an in silico method to predict translation initiation sites in vertebrate cDNA or mRNA sequences. This method consists of three sequential steps as follows. In the first step, candidate features are generated using k-gram amino acid patterns. In the second step, a small number of top-ranked features are selected by an entropy-based algorithm. In the third step, a classification model is built to recognize true TISs by applying support vector machines or ensembles of decision trees to the selected features. We have tested our method on several independent data sets, including two public ones and our own extracted sequences. The experimental results achieved are better than those reported previously using the same data sets. Our high accuracy not only demonstrates the feasibility of our method, but also indicates that there might be 'amino acid' patterns around TIS in cDNA and mRNA sequences.
Liu, H, Han, H, Li, J & Wong, L 2004, 'Using amino acid patterns to accurately predict translation initiation sites', In Silico Biology, vol. 4, no. 3, pp. 255-269.
The translation initiation site (TIS) prediction problem is about how to correctly identify TIS in mRNA, cDNA, or other types of genomic sequences. High prediction accuracy can be helpful in a better understanding of protein coding from nucleotide sequences. This is an important step in genomic analysis to determine protein coding from nucleotide sequences. In this paper, we present an in silico method to predict translation initiation sites in vertebrate cDNA or mRNA sequences. This method consists of three sequential steps as follows. In the first step, candidate features are generated using k-gram amino acid patterns. In the second step, a small number of top-ranked features are selected by an entropy-based algorithm. In the third step, a classification model is built to recognize true TISs by applying support vector machines or ensembles of decision trees to the selected features. We have tested our method on several independent data sets, including two public ones and our own extracted sequences. The experimental results achieved are better than those reported previously using the same data sets. Our high accuracy not only demonstrates the feasibility of our method, but also indicates that there might be "amino acid" patterns around TIS in cDNA and mRNA sequences. © 2004 - IOS Press and Bioinformation Systems e.V. and the authors. All rights reserved.
Liu, HQ, Li, JY & Wong, LS 2004, 'Selection of patient samples and genes for outcome prediction', 2004 IEEE COMPUTATIONAL SYSTEMS BIOINFORMATICS CONFERENCE, PROCEEDINGS, pp. 382-392.
Meng, S, Zhang, Z & Li, J 2004, 'Twelve C2h2 Zinc-finger Genes On Human Chromosome 19 Can Be Each Translated Into The Same Type Of Protein After Frameshifts', Bioinformatics, vol. 20, no. 1, pp. 1-4.View/Download from: Publisher's site
We report a discovery that, of the 226 C2H2 zinc-finger (C2H2-ZNF) genes on human chromosome 19, 12 genes each have two open reading frames (ORFs) that are in different reading frames but that can be translated into the same type of C2H2-ZNF proteins. We
Li, J, Liu, H, Downing, J, Yeoh, A & Wong, L 2003, 'Simple Rules Underlying Gene Expression Profiles Of More Than Six Subtypes Of Acute Lymphoblastic Leukemia (all) Patients', Bioinformatics, vol. 19, no. 1, pp. 71-78.View/Download from: Publisher's site
Motivations and Results: For classifying gene expression profiles or other types of medical data, simple rules are preferable to non-linear distance or kernel functions. This is because rules may help us understand more about the application in addition
Li, J, Liu, H, Ng, S & Wong, L 2003, 'Discovery Of Significant Rules For Classifying Cancer Diagnosis Data', Bioinformatics, vol. 19, no. NA, pp. 0-0.
Methods and Results: We introduce a new method to discover many diversified and significant rules from high dimensional profiling data. We also propose to aggregate the discriminating power of these rules for reliable predictions. The discovered rules ar
Li, JY, Ng, SK & Wong, LS 2003, 'Bioinformatics adventures in database research', DATABASE THEORY ICDT 2003, PROCEEDINGS, vol. 2572, pp. 31-46.
Liu, H, Han, H, Li, J & Wong, L 2003, 'An in-silico method for prediction of polyadenylation signals in human sequences.', Genome informatics. International Conference on Genome Informatics, vol. 14, pp. 84-93.
This paper presents a machine learning method to predict polyadenylation signals (PASes) in human DNA and mRNA sequences by analysing features around them. This method consists of three sequential steps of feature manipulation: generation, selection and integration of features. In the first step, new features are generated using k-gram nucleotide acid or amino acid patterns. In the second step, a number of important features are selected by an entropy-based algorithm. In the third step, support vector machines are employed to recognize true PASes from a large number of candidates. Our study shows that true PASes in DNA and mRNA sequences can be characterized by different features, and also shows that both upstream and downstream sequence elements are important for recognizing PASes from DNA sequences. We tested our method on several public data sets as well as our own extracted data sets. In most cases, we achieved better validation results than those reported previously on the same data sets. The important motifs observed are highly consistent with those reported in literature.
Li, J & Wong, L 2002, 'Identifying Good Diagnostic Gene Groups From Gene Expression Profiles Using The Concept Of Emerging Patterns', Bioinformatics, vol. 18, no. 5, pp. 725-734.View/Download from: Publisher's site
Motivations and Results: Gene groups that are significantly related to a disease can be detected by conducting a series of gene expression experiments. This work is aimed at discovering special types of gene groups that satisfy the following property. In
Liu, H, Li, J & Wong, L 2002, 'A comparative study on feature selection and classification methods using gene expression profiles and proteomic patterns.', Genome informatics. International Conference on Genome Informatics, vol. 13, pp. 51-60.
Feature selection plays an important role in classification. We present a comparative study on six feature selection heuristics by applying them to two sets of data. The first set of data are gene expression profiles from Acute Lymphoblastic Leukemia (ALL) patients. The second set of data are proteomic patterns from ovarian cancer patients. Based on features chosen by these methods, error rates of several classification algorithms were obtained for analysis. Our results demonstrate the importance of feature selection in accurately classifying new samples.
Yeoh, E, Ross, M, Shurtleff, S, Williams, W, Patel, D, Mahfouz, R, Behm, F, Raimondi, S, Relling, M, Patel, A, Cheng, C, Campana, D, Wilkins, D, Zhou, X, Li, J, Liu, H, Pui, C, Evans, W, Naeve, C, Wong, L & Downing, J 2002, 'Classification, Subtype Discovery, And Prediction Of Outcome In Pediatric Acute Lymphoblastic Leukemia By Gene Expression Profiling', Cancer Cell, vol. 1, no. 2, pp. 133-143.View/Download from: Publisher's site
Treatment of pediatric acute lymphoblastic leukemia (ALL) is based on the concept of tailoring the intensity of therapy to a patient's risk of relapse. To determine whether gene expression profiling could enhance risk assignment, we used oligonucleotide
Li, J, Dong, G & Ramamohanarao, K 2000, 'Making Use Of The Most Expressive Jumping Emerging Patterns For Classification', Knowledge Discovery And Data Mining, Proceedings: Current Issues And New Applications, vol. 1805, no. NA, pp. 220-232.
Classification aims to discover a model from training data that can be used to predict the class of test instances. In this paper, we propose the use of jumping emerging patterns (JEPs) as the basis for a new classifier called them JEP-Classifier. Each J
Chow, T & Li, J 1997, 'Higher-order Petri Net Models Based On Artificial Neural Networks', Artificial Intelligence, vol. 92, no. 40940, pp. 289-300.
In this paper, the properties of higher-order neural networks are exploited in a new class of Petri nets, called higher-order Petri nets (HOPN). Using the similarities between neural networks and Petri nets this paper demonstrates how the McCullock-Pitts
Chow, TWS & Li, JY 1997, 'Higher-order Petri net models based on artificial neural networks', Artificial Intelligence, vol. 92, no. 1-2, pp. 289-300.
In this paper, the properties of higher-order neural networks are exploited in a new class of Petri nets, called higher-order Petri nets (HOPN). Using the similarities between neural networks and Petri nets this paper demonstrates how the McCullock-Pitts models and the higher-order neural networks can be represented by Petri nets. A 5-tuple HOPN is defined, a theorem on the relationship between the potential firability of the goal transition and the T-invariant (HOPN) is proved and discussed. The proposed HOPN can be applied to the polynomial clause subset of first-order predicate logic. A five-clause polynomial logic program example is also included to illustrate the theoretical results. © 1997 Elsevier Science B.V.
Li, J & Chow, T 1997, 'Stochastic Choice Of Basis Functions In Adaptive Function Approximation And The Functional-link Net - Comments', IEEE Transactions on Neural Networks, vol. 8, no. 2, pp. 452-454.
This paper includes some comments and amendments of the above-mentioned paper, Subsequently, Theorem 1 in the above-mentioned paper has been revised, The significant change of the original theorem is the space of the thresholds in the hidden layer, The re
Li, J & Chow, T 1996, 'Function approximation of higher-order neural networks', Journal of Intelligent Systems, vol. 6, no. 3-4, pp. 239-260.
Li, J & Yu, Y 1996, 'Studies on the approximation capability of the second-order neural network', Kongzhi Lilun Yu Yingyong/Control Theory and Applications, vol. 13, no. 4.
The approximation capability of the second-order neural network is investigated in this paper and the following results have been obtained: 1)It has been proved that the second-order neural network can approximate any continuous function with any degree of accuracy; 2) the BP algorithm for second-order neural network and the simulated results are given in this paper. The simulated experiments show that when the number of hidden neurons in the second-order neural network is equal to the first-order one's, the error of the second-order neural network decreases faster than the first-order one's; when the accuracies of both of the second-order and first-order neural networks are equal, the number of hidden neurons in the second-order is far smaller than the first-order one's.
Li, J & Fong, S 2016, 'Solving imbalanced dataset problems for high-dimensional image processing by swarm optimization' in Bio-Inspired Computation and Applications in Image Processing, pp. 311-321.View/Download from: Publisher's site
© 2016 Elsevier Ltd All rights reserved. In this chapter, techniques used to optimize the imbalanced class and high-dimensional image datasets using swarm intelligence (SI) algorithms are proposed. Datasets converted from images or multimedia usually have problems of imbalanced class distribution and high-dimensional features. These problems seriously affect the accuracy and efficiency of image processing, especially in machine learning. Compared with other methods, SI optimization algorithms can simultaneously and stochastically solve these two problems in a search space. The SI optimization algorithm is a relatively new approach in the field of artificial intelligence. Specifically. the classical particle swarm optimization and contemporary bat-inspired algorithm are adopted in our experiment. Our proposed method achieved high reliability and high accuracy in classification performance from a computer simulation experiment. Moreover, it can synthesize more reasonable minority class samples as well as select the appropriate features when compared to the existing methods.
Fong, S, Li, J, Gong, X & Vasilakos, AV 2015, 'Advances of applying metaheuristics to data mining techniques' in Improving Knowledge Discovery through the Integration of Data Mining Techniques, pp. 75-103.View/Download from: Publisher's site
© 2015, IGI Global. All rights reserved. Metaheuristics have lately gained popularity among researchers. Their underlying designs are inspired by biological entities and their behaviors, e.g. schools of fish, colonies of insects, and other land animals etc. They have been used successfully in optimization applications ranging from financial modeling, image processing, resource allocations, job scheduling to bioinformatics. In particular, metaheuristics have been proven in many combinatorial optimization problems. So that it is not necessary to attempt all possible candidate solutions to a problem via exhaustive enumeration and evaluation which is computationally intractable. The aim of this paper is to highlight some recent research related to metaheuristics and to discuss how they can enhance the efficacy of data mining algorithms. An upmost challenge in Data Mining is combinatorial optimization that, often lead to performance degradation and scalability issues. Two case studies are presented, where metaheuristics improve the accuracy of classification and clustering by avoiding local optima.
Li, J & Liu, Q 2013, 'Protein Binding Interfaces and Their Binding Hot Spot Prediction: A Survey' in Shen, B (ed), Bioinformatics for Diagnosis, Prognosis and Treatment of Complex Diseases, Springer, German, pp. 79-106.View/Download from: Publisher's site
In living organisms, genes are the blueprints or library, specifying instructions for building proteins. Proteins constitute the bulk of cells. Proteins mutual binding and interactions play a vital role in numerous functions and activities, such as signal transduction, enzymatic reactions, immunoreactions and inter-cellular communications. This survey provides basic knowledge of proteins and protein binding. First, we describe proteins fundamental elements, structures and functions. In Sect. 5.2, we present concepts related to protein binding and interactions. In Sect. 5.3, we explain why protein binding interfaces have a uneven distribution of binding free energy. In the Sects. 5.4 and 5.5, we explain why protein interfaces are complicated and how the current studies deal with this difficult problem. In Sect. 5.6, we present an overview on methods to model and predict binding free energy of protein interactions. Section 5.7 concludes this survey with a summary.
Li, J & Wong, L 2013, 'Emerging Pattern-Based Rules Characterizing Subtypes of Leukemia' in Dong, G & Bailey, J (eds), Contrast Data Mining: Concepts, Algorithms, and Applications, CRC Press, USA, pp. 219-232.View/Download from: Publisher's site
Dong, G, Li, J, Liu, G & Wong, L 2009, 'Mining conditional contrast patterns' in Post-Mining of Association Rules: Techniques for Effective Knowledge Extraction, pp. 294-310.View/Download from: Publisher's site
This chapter considers the problem of "conditional contrast pattern mining." It is related to contrast mining, where one considers the mining of patterns/models that contrast two or more datasets, classes, conditions, time periods, and so forth. Roughly speaking, conditional contrasts capture situations where a small change in patterns is associated with a big change in the matching data of the patterns. More precisely, a conditional contrast is a triple (B, F1, F2) of three patterns; B is the condition/context pattern of the conditional contrast, and F 1 and F 2 are the contrasting factors of the conditional contrast. Such a conditional contrast is of interest if the difference between F 1 and F 2 as itemsets is relatively small, and the difference between the corresponding matching dataset of B∪F 1 and that of B∪F2 is relatively large. It offers insights on "discriminating" patterns for a given condition B. Conditional contrast mining is related to frequent pattern mining and analysis in general, and to the mining and analysis of closed pattern and minimal generators in particular. It can also be viewed as a new direction for the analysis (and mining) of frequent patterns. After formalizing the concepts of conditional contrast, the chapter will provide some theoretical results on conditional contrast mining. These results (i) relate conditional contrasts with closed patterns and their minimal generators, (ii) provide a concise representation for conditional contrasts, and (iii) establish a so-called dominance-beam property. An efficient algorithm will be proposed based on these results, and experiment results will be reported. Related works will also be discussed. © 2009, IGI Global.
Feng, M, Li, J, Dong, G & Wong, L 2009, 'Maintenance of frequent patterns: A survey' in Post-Mining of Association Rules: Techniques for Effective Knowledge Extraction, pp. 273-293.View/Download from: Publisher's site
This chapter surveys the maintenance of frequent patterns in transaction datasets. It is written to be accessible to researchers familiar with the field of frequent pattern mining. The frequent pattern maintenance problem is summarized with a study on how the space of frequent patterns evolves in response to data updates. This chapter focuses on incremental and decremental maintenance. Four major types of maintenance algorithms are studied: Apriori-based, partition-based, prefix-tree-based, and concise-representation-based algorithms. The authors study the advantages and limitations of these algorithms from both the theoretical and experimental perspectives. Possible solutions to certain limitations are also proposed. In addition, some potential research opportunities and emerging trends in frequent pattern maintenance are also discussed1. © 2009, IGI Global.
Pedrycz, W 2005, 'Genetic search for logic structures in data' in Next Generation of Data-Mining Applications, pp. 631-661.View/Download from: Publisher's site
Abdollahi, M, Gao, X, Mei, Y, Ghosh, S & Li, J 2019, 'An Ontology-based Two-Stage Approach to Medical Text Classification with Feature Selection by Particle Swarm Optimisation', 2019 IEEE Congress on Evolutionary Computation, CEC 2019 - Proceedings, 2019 IEEE Congress on Evolutionary Computation, Institute of Electrical and Electronics Engineers, New Zealand, pp. 119-126.View/Download from: Publisher's site
© 2019 IEEE. Document classification (DC) is the task of assigning pre-defined labels to unseen documents by utilizing a model trained on the available labeled documents. DC has attracted much attention in medical fields recently because many issues can be formulated as a classification problem. It can assist doctors in decision making and correct decisions can reduce the medical expenses. Medical documents have special attributes that distinguish them from other texts and make them difficult to analyze. For example, many acronyms and abbreviations, and short expressions make it more challenging to extract information. The classification accuracy of the current medical DC methods is not satisfactory. The goal of this work is to enhance the input feature sets of the DC method to improve the accuracy. To approach this goal, a novel two-stage approach is proposed. In the first stage, a domain-specific dictionary, namely the Unified Medical Language System (UMLS), is employed to extract the key features belonging to the most relevant concepts such as diseases or symptoms. In the second stage, PSO is applied to select more related features from the extracted features in the first stage. The performance of the proposed approach is evaluated on the 2010 Informatics for Integrating Biology and the Bedside (i2b2) data set which is a widely used medical text dataset. The experimental results show substantial improvement by the proposed method on the accuracy of classification.
Abdollahi, M, Gao, X, Mei, Y, Ghosh, S & Li, J 2019, 'Stratifying Risk of Coronary Artery Disease Using Discriminative Knowledge-Guided Medical Concept Pairings from Clinical Notes', PRICAI 2019: Trends in Artificial Intelligence, Pacific Rim International Conference on Artificial Intelligence, Springer, Cuvu, Yanuca Island, Fiji, pp. 457-473.View/Download from: Publisher's site
© 2019, Springer Nature Switzerland AG. Document classification (DC) is one of the broadly investigated natural language processing tasks. Medical document classification can support doctors in making decision and improve medical services. Since the data in document classification often appear in raw form such as medical discharge notes, extracting meaningful information to use as features is a challenging task. There are many specialized words and expressions in medical documents which make them more challenging to analyze. The classification accuracy of available methods in medical field is not good enough. This work aims to improve the quality of the input feature sets to increase the accuracy. A new three-stage approach is proposed. In the first stage, the Unified Medical Language System (UMLS) which is a medical-specific dictionary is used to extract the meaningful phrases by considering disease or symptom concepts. In the second stage, all the possible pairs of the extracted concepts are created as new features. In the third stage, Particle Swarm Optimisation (PSO) is employed to select features from the extracted and constructed features in the previous stages. The experimental results show that the proposed three-stage method achieved substantial improvement over the existing medical DC approaches.
Zhang, X, Zhang, X, Verma, S, Liu, Y, Blumenstein, M & Li, J 2019, 'Detection of Anomalous Traffic Patterns and Insight Analysis from Bus Trajectory Data', PRICAI 2019: Trends in Artificial Intelligence, The 16th Pacific Rim International Conference on Artificial Intelligence, Cuvu, Fiji.
Abdollahi, M, Gao, X, Mei, Y, Ghosh, S & Li, J 2018, 'Uncovering discriminative knowledge-guided medical concepts for classifying coronary artery disease notes', Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Australasian Joint Conference on Artificial Intelligence, SpringerLink, Wellington, New Zealand, pp. 104-110.View/Download from: Publisher's site
© Springer Nature Switzerland AG 2018. Text classification is a challenging task for allocating each document to the correct predefined class. Most of the time, there are irrelevant features which make noise in the learning step and reduce the precision of prediction. Hence, more efficient methods are needed to select or extract meaningful features to avoid noise and overfitting. In this work, an ontology-guided method utilizing the taxonomical structure of the Unified Medical Language System (UMLS) is proposed. This method extracts concepts of appeared phrases in the documents which relate to diseases or symptoms as features. The efficiency of this method is evaluated on the 2010 Informatics for Integrating Biology and the Bedside (i2b2) data set. The obtained experimental results show significant improvement by the proposed ontology-based method on the accuracy of classification.
Li, J 2018, 'Version Space Completeness for Novel Hypothesis Induction in Biomedical Applications', Proceedings of the International Joint Conference on Neural Networks, International Joint Conference on Neural Networks, IEEE, Rio de Janeiro, Brazil.View/Download from: Publisher's site
© 2018 IEEE. Use of traditional discretization methods caused a heavy loss of hypotheses in the induction of version spaces. We present a new discretization method, named two-point discretization, to construct an interval covering all the positive data points of a variable as purely as possible. We prove that the two-point discretization is a necessary and sufficient con- dition to guarantee the completeness of version spaces (i.e., no loss of hypothesis). A linear complexity algorithm is proposed to implement these theories. The algorithm is also applied to real-world bioinformatics problems to induce significant biomedical hypotheses which have been never discovered by the traditional approaches.
Li, J, Fong, S, Hu, S, Wong, RK & Mohammed, S 2017, 'Similarity majority under-sampling technique for easing imbalanced classification problem', Communications in Computer and Information Science, Australasian Conference on Data Mining, Springer, Melbourne, VIC, Australia, pp. 3-23.View/Download from: Publisher's site
© Springer Nature Singapore Pte Ltd. 2018. Imbalanced classification problem is an enthusiastic topic in the fields of data mining, machine learning and pattern recognition. The imbalanced distributions of different class samples result in the classifier being over-fitted by learning too many majority class samples and under-fitted in recognizing minority class samples. Prior methods attempt to ease imbalanced problem through sampling techniques, in order to re-assign and rebalance the distributions of imbalanced dataset. In this paper, we proposed a novel notion to under-sample the majority class size for adjusting the original imbalanced class distributions. This method is called Similarity Majority Under-sampling Technique (SMUTE). By calculating the similarity of each majority class sample and observing its surrounding minority class samples, SMUTE effectively separates the majority and minority class samples to increase the recognition power for each class. The experimental results show that SMUTE could outperform the current under-sampling methods when the same under-sampling rate is used.
Rahman, JS, Li, J, Xie, J, Fogelman, S & Blumenstein, M 2018, 'Connectivity Based Method for Clustering Microbial Communities from Metagenomics Data of Water and Soil Samples', Proceedings of the International Joint Conference on Neural Networks, International Joint Conference on Neural Networks, IEEE, Rio de Janeiro, Brazil, pp. 1-8.View/Download from: Publisher's site
© 2018 IEEE. Understanding microbial community structure of metagenomics water and soil samples is a key process in discovering functions and impact of microorganisms on human and animal health. Evolution of Next Generation Sequencing (NGS) technology has encouraged researchers to sequence large quantity of microbial data from environmental sources. Clustering marker gene sequences into Operational Taxonomic Units (OTU) is the most significant task in microbial community analysis. Several methods have been developed over the years to improve OTU picking strategies. However, building strongly connected OTUs is a major issue in majority of these methods. Herein we present ConClust, a novel method for clustering OTUs that is based on quantifying connectivity among the sequences. Experimental analysis on two synthetic datasets and two real world datasets from water and soil samples demonstrate that our method can mine robust OTUs. Our method can be highly benelicial to study functions of known and unknown microbes and analyze their positive and negative effect on the environment as well as human and animal health.
Lan, C, Peng, H, McGowan, EM, Hutvagner, G & Li, J 2018, 'An isomIR expression panel based novel breast cancer classification approach using improved mutual information', International Conference on Genome Informatics, Kunming, Yunnan, China.
Zhang, X, Liu, Y, Zheng, Y, Zhao, Z, Li, J & Liu, Y 2018, 'Distinction between Ships and Icebergs in SAR Images Using Ensemble Loss Trained Convolutional Neural Networks', AI 2018: AI 2018: Advances in Artificial Intelligence (LNAI), Australasian Joint Conference on Artificial Intelligence, Springer, Wellington, New Zealand, pp. 216-223.View/Download from: Publisher's site
With the phenomenon of global warming, more new shipping routes will be open and utilized by more and more ships in the polar regions, particularly in the Arctic. Synthetic aperture radar (SAR) has been widely used in ship and iceberg monitoring for maritime surveillance and safety in the Arctic waters. At present, compared with the object detection of ship or iceberg, the task of ship and iceberg distinction in SAR images is still in challenge. In this work, we propose a novel loss function called ensemble loss to train convolutional neural networks (CNNs), which is a convex function and incorporates the traits of cross entropy and hinge loss. The ensemble loss trained CNNs model for the distinction between ship and iceberg is evaluated on a real-world SAR data set, which can get a higher classification accuracy to 90.15%. Experiment on another real image data set also confirm the effectiveness of the proposed ensemble loss.
Zheng, Y, Peng, H, Zhang, X, Gao, X & Li, J 2018, 'Predicting Drug Targets from Heterogeneous Spaces using Anchor Graph Hashing and Ensemble Learning', Proceedings of the International Joint Conference on Neural Networks, International Joint Conference on Neural Networks, IEEE, Rio de Janeiro, Brazil.View/Download from: Publisher's site
© 2018 IEEE. The in silico prediction of potential drug-targetinteractions is of critical importance in drug research. Existing computational methods have achieved remarkable prediction accuracy, however usually obtain poor prediction efficiency due to computational problems. To improve the prediction efficiency, we propose to predict drug targets based on inte- gration of heterogeneous features with anchor graph hashing and ensemble learning. First, we encode each drug as a 5682- bit vector, and each target as a 4198-bit vector using their heterogeneous features respectively. Then, these vectors are embedded into low-dimensional Hamming Space using anchor graph hashing. Next, we append hashing bits of a target to hashing bits of a drug as a vector to represent the drug-target pair. Finally, vectors of positive samples composed of known drug-target pairs and randomly selected negative samples are used to train and evaluate the ensemble learning model. The performance of the proposed method is evaluated on simulative target prediction of 1094 drugs from DrugBank. Ex- tensive comparison experiments demonstrate that the proposed method can achieve high prediction efficiency while preserving satisfactory accuracy. In fact, it is 99.3 times faster and only 0.001 less in AUC than the best literature method 'Pairwise Kernel Method'.
Li, J, Fong, S, Hu, S, Chu, VW, Wong, RK, Mohammed, S & Dey, N 2017, 'Rare event prediction using similarity majority under-sampling technique', Communications in Computer and Information Science, pp. 23-39.View/Download from: Publisher's site
© Springer Nature Singapore Pte Ltd. 2017. In data mining it is not uncommon to be confronted by imbalanced classification problem in which interesting samples are rare. Having too many ordinary but too few rare samples as training data, will mislead the classifier to become over-fitted by learning too much from majority class samples and become under-fitted lacking recognizing power for minority class samples. In this research work, a novel rebalancing technique that under-samples (reduce by sampling) the majority class size for subsiding the imbalanced class distributions without synthesizing extra training samples, is studied. This simple method is called Similarity Majority Under-Sampling Technique (SMUTE). By measuring the similarity between each majority class sample and its surrounding minority class samples, SMUTE effectively discriminates the majority and minority class samples with consideration of not changing too much of the underlying non-linear mapping between the input variables and the target classes. Two experiments are conducted and reported in this paper: one is an extensive performance comparison of SMUTE with the states-of-the-arts using generated imbalanced data; the other is the use of real data representing a case of natural disaster prevention where accident samples are rare. SMUTE is found to be working favourably well over other methods in both cases.
Zheng, Y, Ghosh, S & Li, J 2017, 'An optimized drug similarity framework for side-effect prediction', Computing in Cardiology, Computing in Cardiology Conference, Rennes, France, pp. 1-4.View/Download from: Publisher's site
© 2017 IEEE Computer Society. All rights reserved. Drug side-effects are crucial issues in both the pre-market drug developing process and post-market drug clinical applications. They contribute to one-third of drug failures and cause significant fatality and severe morbidity. Thus the early identification of potential drug side-effects is of great interests. Most existing methods essentially rely on leveraging few drug similarities directly for side-effect predictions, ignoring the performance improvement by drug similarity integration and optimization. In this study, we proposed an optimized drug similarity framework (ODSF) to improve the performance of side-effect predictions. First, this framework integrates four different drug similarities into a comprehensive similarity. Next, the comprehensive similarity is optimized via clustering and then enhanced by indirect drug similarity. Finally, the optimized drug similarity is employed for side-effect predictions. The performance of ODSF was evaluated on simulative side-effect predictions of 917 drugs from the DrugBank. Extensive comparison experiments demonstrate that ODSF is competent to capture drug features from diverse perspectives and the prediction performance is significantly improved owing to the optimized drug similarity.
Zhou, Z, Xu, GUANGDONG, Zhu, W, Li, J & Zhang, W 2017, 'Structure embedding for knowledge base completion and analytics.', International Joint Conference on Neural Networks IJCNN 2017, IEEE, Anchorage, Alaska, USA.View/Download from: Publisher's site
To explore the latent information of Human Knowledge, the analysis for Knowledge Bases (KBs) (e.g. WordNet, Freebase) is essential. Some previous KB element embedding frameworks are used for KBs structure analysis and completion. These embedding frameworks use low-dimensional vector space representation for large scale of entities and relations in KB. Based on that, the vector space representation of entities and relations which are not contained by KB can be measured. The embedding idea is reasonable, while the current embedding methods have some issues to get proper embeddings for KB elements. The embedding methods use entity-relation-entity triplet, contained by most of current KB, as training data to output the embedding representation of entities and relations. To measure the truth of one triplet (The knowledge represented by triplet is true or false), some current embedding methods such as Structured Embedding (SE) project entity vectors into subspace, the meaning of such subspace is not clear for knowledge reasoning. Some other embedding methods such as TransE use simple linear vector transform to represent relation (such as vector add or minus), which can't deal with the multiple relations match or multiple entities match problem. For example, there are multiple relations between two entities, or there are multiple entities have same relation with one entity. Insipred by previous KB element structured embedding methods, we propose a new method, Bipartite Graph Network Structured Embedding (BGNSE). BGNSE combines the current KB embedding methods with bipartite graph network model, which is widely used in many fields including image data compression, collaborative filtering. BGNSE embeds each entity-relation-entity KB triplet into a bipartite graph network structure model, represents each entity by one bipartite graph layer, represents relation by link weights matrix of bipartite graph network. Based on bipartite graph model, our proposed method has followi...
Chen, Q, Lan, C, Li, J, Chen, B, Wang, L & Zhang, C 2016, 'Depth-first search encoding of RNA substructures', Intelligent Computing Theories and Application, International Conference on Intelligent Computing (ICIC), Springer, Lanzhou, China, pp. 328-334.View/Download from: Publisher's site
© Springer International Publishing Switzerland 2016.RNA structural motifs are important in RNA folding process. Traditional index-based and shape-based schemas are useful in modeling RNA secondary structures but ignore the structural discrepancy of individual RNA family member. Further, the in-depth analysis of underlying substructure pattern is underdeveloped owing to varied and unnormalized substructures. This prevents us from understanding RNAs functions. This article proposes a DFS (depth-first search) encoding for RNA substructures. The results show that our methods are useful in modelling complex RNA secondary structures.
Ghosh, S, Nguyen, H & Li, J 2016, 'Predicting short-term ICU outcomes using a sequential contrast motif based classification framework', Proceedings of the Annual International Conference of the IEEE Engineering in Medicine and Biology Society, EMBS, International Conference of the IEEE Engineering in Medicine and Biology Society, IEEE, Orlando, Florida, USA, pp. 5612-5615.View/Download from: Publisher's site
© 2016 IEEE.Critical ICU events like acute hypotension and septic shock are dangerous complications, leading to multiple organ failures and eventual death. Previously, pattern mining algorithms have been employed for extracting interesting rules in various clinical domains. However, the extracted rules are directly investigated by clinicians for diagnosing a disease. Towards this purpose, there is a need to develop advanced prediction models which integrate dynamic patterns to learn a patient's physiological condition. In this study, a sequential contrast patterns-based classification framework is presented for detecting critical patient events, like hypotension and septic shock. Initially, a set of sequential patterns are obtained by using a contrast mining algorithm. Later, these patterns undergo post-processing, for conversion to two novel representations-(1) frequency-based feature space and (2) ordered sequences of patterns, which conserve positional information of a pattern in a time series sequence. Each of these representations are automatically used for developing classification models using SVM and HMM methods. Our results on hypotension and septic shock datasets from a large scale ICU database demonstrate better predictive capabilities, when sequential patterns are used as features.
Ghosh, S, Zheng, Y, Lammers, T, Chen, YY, Fitzmaurice, C, Johnston, S & Li, J 2016, 'Deriving public sector workforce insights: A case study using Australian public sector employment profiles', Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), International Conference on Advanced Data Mining and Applications, Springer, Gold Coast, Queensland, Australia, pp. 764-774.View/Download from: Publisher's site
© Springer International Publishing AG 2016.Effective approaches for measurement of human capital in public sector and government agencies is essential for robust workforce planning against changing economic conditions. To this purpose, adopting innovative hypotheses driven workforce data analysis can help discover hidden patterns and trends about the workforce. These trends are useful for decision making and support the development of policies to reach desired employment outcomes. In this study, the data challenges and approaches to a real life workforce analytics scenario are described. Statistical results from numerous workforce data experiments are combined to derive three hypotheses that are useful to public sector organisations for human resources management and decision making.
Li, J, Fong, S, Yuan, M & Wong, RK 2016, 'Adaptive multi-objective swarm crossover optimization for imbalanced data classification', Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), pp. 374-390.View/Download from: Publisher's site
© Springer International Publishing AG 2016. Training a classifier with imbalanced dataset where there are more data from the majority class than the minority class is a known problem in data mining research community. The resultant classifier would become under-fitted in recognizing test instances of minority class and over-fitted with overwhelming mediocre samples from the majority class. Many existing techniques have been tried, ranging from artificially boosting the amount of the minority class training samples such as SMOTE, downsizing the volume of the majority class samples, to modifying the classification induction algorithm in favour of the minority class. However, finding the optimal ratio between the samples from the two majority/minority class for building a classifier that has the best accuracy is tricky, due to the non-linear relationships between the attributes and the class labels. Merely rebalancing the sample sizes of the two classes to exact portions will often not produce the best result. Brute-force attempt to search for the perfect combination of majority/minority class samples for the best classification result is NP-hard. In this paper, a unified preprocessing approach is proposed, using stochastic swarm heuristics to cooperatively optimize the mixtures from the two classes by progressively rebuilding the training dataset is proposed. Our novel approach is shown to outperform the existing popular methods.
Liu, Q, Li, J, Wong, L & Ramamohanarao, K 2016, 'Efficient mining of pan-correlation patterns from time course data', Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), International Conference on Advanced Data Mining and Applications, Springer, Gold Coast, Queensland, Australia, pp. 234-249.View/Download from: Publisher's site
© Springer International Publishing AG 2016.There are different types of correlation patterns between the variables of a time course data set, such as positive correlations, negative correlations, time-lagged correlations, and those correlations containing small interrupted gaps. Usually, these correlations are maintained only on a subset of time points rather than on the whole span of the time points which are traditionally required for correlation definition. As these types of patterns underline different trends of data movement, mining all of them is an important step to gain a broad insight into the dependencies of the variables. In this work, we prove that these diverse types of correlation patterns can be all represented by a generalized form of positive correlation patterns. We also prove a correspondence between positive correlation patterns and sequential patterns. We then present an efficient single-scan algorithm for mining all of these types of correlations. This “pan-correlation” mining algorithm is evaluated on synthetic time course data sets, as well as on yeast cell cycle gene expression data sets. The results indicate that: (i) our mining algorithm has linear time increment in terms of increasing number of variables; (ii) negative correlation patterns are abundant in real-world data sets; and (iii) correlation patterns with time lags and gaps are also abundant. Existing methods have only discovered incomplete forms of many of these patterns, and have missed some important patterns completely.
Xie, J, Wang, M, Zhou, Y & Li, J 2016, 'Coordinating discernibility and independence scores of variables in a 2D space for efficient and accurate feature selection', 12th Intelligent Computing Methodologies Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Intelligent Computing Methodologies (ICIC), Springer, Lanzhou, China, pp. 116-127.View/Download from: Publisher's site
© Springer International Publishing Switzerland 2016.Feature selection is to remove redundant and irrelevant features from original ones of exemplars, so that a sparse and representative feature subset can be detected for building a more efficient and accurate classifier. This paper presents a novel definition for the discernibility and independence scores of a feature, and then constructs a two dimensional (2D) space with the feature’s independence as y-axis and discernibility as x-axis to rank features’ importance. This new method is named FSDI (Feature Selection based on Discernibility and Independence of a feature). The discernibility score of a feature is to measure the distinguishability of the feature to detect instances from different classes. The independence score is to measure the redundancy of a feature. All features are plotted in the 2D space according to their discernibility and independence coordinates. The area of the rectangular corresponding to a feature’s discernibility and independence in the 2D space is used as a criterion to rank the importance of the features. Top-k features with much higher importance than the rest ones are selected to form the sparse and representative feature subset for building an efficient and accurate classifier. Experimental results on 5 classical gene expression datasets demonstrate that our proposed FSDI algorithm can select the gene subset efficiently and has the best performance in classification. Our method provides a good solution to the bottleneck issues related to the high time complexity of the existing gene subset selection algorithms.
Zheng, Y, Lan, C, Peng, H & Li, J 2016, 'Using Constrained Information Entropy to Detect Rare Adverse Drug Reactions from Medical Forums', 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), International Conference of the IEEE Engineering in Medicine and Biology Society, IEEE, IEEE, pp. 2460-2463.View/Download from: Publisher's site
Adverse drug reactions (ADRs) detection is critical to avoid malpractices yet challenging due to its uncertainty in pre-marketing review and the underreporting in post-marketing surveillance. To conquer this predicament, social media based ADRs detection methods have been proposed recently. However, existing researches are mostly co-occurrence based methods and face several issues, in particularly, leaving out the rare ADRs and unable to distinguish irrelevant ADRs. In this work, we introduce a constrained information entropy (CIE) method to solve these problems. CIE first recognizes the drug-related adverse reactions using a predefined keyword dictionary and then captures high- and low-frequency (rare) ADRs by information entropy. Extensive experiments on medical forums dataset demonstrate that CIE outperforms the state-of-the-art co-occurrence based methods, especially in rare ADRs detection.
Wang, Z, Yang, Y, Chang, S, Li, J, Fong, S & Huang, TS 2015, 'A Joint Optimization Framework of Sparse Coding and Discriminative Clustering', PROCEEDINGS OF THE TWENTY-FOURTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE (IJCAI), 1st International Workshop on Social Influence Analysis / 24th International Joint Conference on Artificial Intelligence (IJCAI), IJCAI-INT JOINT CONF ARTIF INTELL, Buenos Aires, ARGENTINA, pp. 3932-3938.
Ghosh, S, Feng, M, Nguyen, H & Li, J 2014, 'Predicting Heart Beats using Co-occurring Constrained Sequential Patterns', http://www.cinc.org/archives/2014/, Computing in Cardiology, IEEE, Boston USA, pp. 265-268.
The aim of this study is to develop and evaluate a robust method for heart beat detection using a sequential pattern mining framework, based on the multi-modal Physionet 2014 challenge dataset. Each multi-modal patient time series was initially transformed to a symbolic sequence using Symbolic Aggregation Approximation (SAX). A training set was created, by randomly selecting 70% of the data and the rest 30% was used as the test set. Later, all segments of length 100 were extracted, for annotated beat occurrences. Subsequently, an algorithm was used to extract repetitive frequent subsequences, where consecutive symbols are separated by a pre-defined gap range. The patterns for ECG and BP were then ranked based on length and frequency support. For tests, the highest ranked patterns were used to mark beat segments. True beat occurrences were only considered when patterns co-occurred for both ECG and BP within a width of 150 time points. Our results comprise two parts viz. extracted top ranked sequences and gross test statistics. An interpretive highest ranked sequential pattern for ECG looks like [7,7,7,5,5,5,5,5,4,3,10,10,10,2,2,3,3,4,3,4,5,5,5,6,7], for 10 discrete symbols which identify regional signal activity, with a gap range of [2,4] between contiguous elements. As per our test results, the method gives us a sensitivity of 51.66% and a positive predictivity (PPV) of 67.15%. The novelty of mining gap constrained co-occurring frequent sequential patterns lies in its ability to capture approximate co-occurring long clinical episodes across multiple variables, even if the quality of one signal suffers for a certain period of time. A higher PPV indicates that our method did not have a lot of false positives (detecting non-beats). The method is still being improved and will be further tested in the next stages of the Ph
Ghosh, S, Feng, M, Nguyen, H & Li, J 2014, 'Risk Prediction for Acute Hypotensive Patients by using Gap Constrained Sequential Contrast Patterns', http://knowledge.amia.org/56638-amia-1.1540970/t-004-1.1544972?qr=1, AMIA Annual Symposium, AMIA, Washington D.C., USA.
Li, J, Fong, S, Zhuang, Y & Khoury, R 2014, 'Hierarchical Classification in Text Mining for Sentiment Analysis', 2014 INTERNATIONAL CONFERENCE ON SOFT COMPUTING & MACHINE INTELLIGENCE ISCMI 2014, 2014 International Conference on Soft Computing & Machine Intelligence (ISCMI 2014), IEEE, New Delhi, INDIA, pp. 46-51.View/Download from: Publisher's site
Wei, W, Yin, J, Li, J & Cao, L 2014, 'Modelling Asymmetry and Tail Dependence among Multiple Variables by Using Partial Regular Vine', Proceedings of the 2014 SIAM International Conference on Data Mining, SIAM International Conference on Data Mining, SIAM, Philadelphia, USA, pp. 776-784.View/Download from: Publisher's site
Modeling high-dimensional dependence is widely studied to explore deep relations in multiple variables particularly useful for financial risk assessment. Very often, strong restrictions are applied on a dependence structure by existing high-dimensional dependence models. These restrictions disabled the detection of sophisticated structures such as asymmetry, upper and lower tail dependence between multiple variables. The paper proposes a partial regular vine copula model to relax these restrictions. The new model employs partial correlation to construct the regular vine structure, which is algebraically independent. This model is also able to capture the asymmetric characteristics among multiple variables by using two-parametric copula with flexible lower and upper tail dependence. Our method is tested on a cross-country stock market data set to analyse the asymmetry and tail dependence. The high prediction performance is examined by the Value at Risk, which is a commonly adopted evaluation measure in financial market.
Fong, S, Zhuang, Y, Li, J & Khoury, R 2013, 'Sentiment Anlaysis of Online News using MALLET', 2013 INTERNATIONAL SYMPOSIUM ON COMPUTATIONAL AND BUSINESS INTELLIGENCE (ISCBI), International Symposium on Computational and Business Intelligence (ISCBI), IEEE, New Delhi, INDIA, pp. 301-304.View/Download from: Publisher's site
Wei, W, Li, J, Cao, L, Sun, J, Liu, C & Li, M 2013, 'Optimal Allocation of High Dimensional Assets through Canonical Vines', Advances in Knowledge Discovery and Data Mining: 17th Pacific-Asia Conference, PAKDD 2013, Gold Coast, Australia, April 14-17, 2013, Proceedings, Part I, Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer, Gold Coast, Australia, pp. 366-377.View/Download from: Publisher's site
Canonical Vine, Mean Variance Criterion, Financial Return.
Wei, W, Fan, X, Li, J & Cao, L 2012, 'Model the Complex Dependence Structures of Financial Variables by Using Canonical Vine', The 21st ACM International Conference on Information and Knowledge Management, ACM International Conference on Information and Knowledge Management, Springer, Maui, Hawaii, USA, pp. 1382-1391.View/Download from: Publisher's site
Financial variables such as asset returns in the massive market contain various hierarchical and horizontal relationships forming complicated dependence structures. Modeling and mining of these structures is challenging due to their own high structural complexities as well as the stylized facts of the market data. This paper introduces a new canonical vine dependence model to identify the asymmetric and non-linear dependence structures of asset returns without any prior independence assumptions. To simplify the model while maintaining its merit, a partial correlation based method is proposed to optimize the canonical vine. Compared with the original canonical vine, the new model can still maintain the most important dependence but many unimportant nodes are removed to simplify the canonical vine structure. Our model is applied to construct and analyze dependence structures of European stocks as case studies. Its performance is evaluated by measuring portfolio of Value at Risk, a widely used risk management measure. In comparison to a very recent canonical vine model and the `full' model, our experimental results demonstrate that our model has a much better quality of Value at Risk, providing insightful knowledge for investors to control and reduce the aggregation risk of the portfolio.
Li, J, Liu, Q & Zeng, T 2010, 'Negative correlations in collaboration: concepts and algorithms', Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, Washington DC, pp. 463-472.
Tang, M, Wang, W, Jiang, Y, Zhou, Y, Li, J, Cui, P, Liu, Y & Yan, B 2010, 'Birds Bring Flues? Mining Frequent and High Weighted Cliques from Birds Migration Networks', DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, PT II, PROCEEDINGS, 15th International Conference on Database Systems for Advanced Applications, SPRINGER-VERLAG BERLIN, Tsukuba, JAPAN, pp. 359-+.
Zhu, L & Li, J 2010, 'Water bioinformatics: An association between estrogen degradation and 16S rRNA motifs', 2010 4th International Conference on Bioinformatics and Biomedical Engineering, iCBBE 2010.View/Download from: Publisher's site
The existence of estrogenic compounds in the water severely pollutes the ecological environment. It is believed that microorganisms such as harmless bacterium can be used as a clean and safe medium to naturally degrade the estrogens. Many bacteria have been found to be capable of degrading estrogens in different ways and speeds. While the degradation mechanism, in particular, the association between the degradation capability and their phylogenetic motifs is unknown yet. In this paper, we analyzed the 16S rRNA gene sequences of 17 kinds of bacteria, which are usually used for phylogenetic studies. We examined the association between motifs and degradation by distinguishing such motifs that could separate those bacteria into several similar functional groups. Our computational result shows that the motifs have a various positive associations to the degradation, implying that different biodegradation factors are in the play. © 2010 IEEE.
Chen, P & Li, J 2009, 'Prediction of protein long-range contacts using GaMC approach with sequence profile centers', Proceedings - 2009 IEEE International Conference on Bioinformatics and Biomedicine Workshops, BIBMW 2009, pp. 128-135.View/Download from: Publisher's site
In this paper, we apply an evolutionary optimization classifier, referred to as genetic algorithm-based multiple classifier (GaMC), to the long-range contacts prediction. As a result, about 44.1% contacts between long-range residues (with a sequence separation of at least 24 amino acids) are founded around the sequence profile (SP) centre when evaluating the top L/5 (L is the sequence length of protein) classified contacts if the SP centers are known. Meanwhile, with the knowledge of sequence profile center and the GaMC method, about 20.42% long-range contacts are correctly predicted. Results showed that SP center may be a sound pathway to predict contact map in protein structures. ©2009 IEEE.
Liu, Q, Chen, Y-PP & Li, J 2009, 'High Functional Coherence in k-Partite Protein Cliques of Protein Interaction Networks', 2009 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE, IEEE International Conference on Bioinformatics and Biomedicine (BIBMW 2009), IEEE COMPUTER SOC, Washington, DC, pp. 111-+.View/Download from: Publisher's site
Tang, M, Zhou, Y, Cui, P, Wang, W, Li, J, Zhang, H, Hou, Y & Yan, B 2009, 'Discovery of Migration Habitats and Routes of Wild Bird Species by Clustering and Association Analysis', ADVANCED DATA MINING AND APPLICATIONS, PROCEEDINGS, 5th International Conference on Advanced Data Mining and Applications, SPRINGER-VERLAG BERLIN, Beijing, PEOPLES R CHINA, pp. 288-+.
Zhao, L & Li, J 2009, 'Sequence-based B-cell epitope prediction by using associations in antibody-antigen structural complexes', Proceedings - 2009 IEEE International Conference on Bioinformatics and Biomedicine Workshops, BIBMW 2009, pp. 165-172.View/Download from: Publisher's site
B-cell secreted antibodies play a critical role in fighting against the invaders and abnormal self tissues. Identifying the epitope on antigens recognized by the paratope on antibodies can enlighten the understanding of this important immune mechanism. Predicting B-cell epitope can also pave the way for vaccine design and disease therapy. However, due to the high complexity of this problem, previous prediction methods that focus on linear and conformational epitope are both unsatisfactory. In this work, we propose a novel method to predict B-cell epitopes, when a pair of sequences is given, by using associations and cooperativity patterns from a relatively small antigen-antibody structural data set. More exactly, our classifier is trained on only PDB protein complexes, but it can be applied to any sequence data. Our evaluation results show that the accuracy of our method is very competitive to, sometimes even much better than, previous structure-based prediction methods which have a smaller applicability scope than ours. ©2009 IEEE.
Zhao, L & Li, J 2009, 'Sequence-based B-cell epitope prediction by using associations in antibody-antigen structural complexes', BIBMW: 2009 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE WORKSHOP, IEEE International Conference on Bioinformatics and Biomedicine (BIBMW 2009), IEEE, Washington, DC, pp. 163-170.
Feng, M, Li, J, Wong, L & Tan, Y-P 2008, 'Negative Generator Border for Effective Pattern Maintenance', ADVANCED DATA MINING AND APPLICATIONS, PROCEEDINGS, 4th International Conference on Advanced Data Mining and Applications, SPRINGER-VERLAG BERLIN, Chengdu, PEOPLES R CHINA, pp. 217-+.
Li, J, Sim, K, Liu, G & Wong, L 2008, 'Maximal Quasi-Bicliques with Balanced Noise Tolerance: Concepts and Co-clustering Applications', Proceedings of the 8th SIAM International Conference on Data Mining (SDM08), SIAM, Atlanta, pp. 72-83.
Liu, X, Li, J & Wang, L 2008, 'Quasi-bicliques: Complexity and Binding Pairs', Proceedings of the 14th Annual International Conference, COCOON 2008, Annual International Computing and Combinatorics Conference, Springer, Dalian, pp. 255-264.
Protein-protein interactions (PPIs) are one of the most important mechanisms in cellular processes. To model protein interaction sites, recent studies have suggested to find interacting protein group pairs from large PPI networks at the first step, and then to search conserved motifs within the protein groups to form interacting motif pairs. To consider noise effect and incompleteness of biological data, we propose to use quasi-bicliques for finding interacting protein group pairs. We investigate two new problems which arise from finding interacting protein group pairs: the maximum vertex quasi-biclique problem and the maximum balanced quasi-biclique problem. We prove that both problems are NP-hard. This is a surprising result as the widely known maximum vertex biclique problem is polynomial time solvable . We then propose a heuristic algorithm which uses the greedy method to find the quasi-bicliques from PPI networks. Our experiment results on real data show that this algorithm has a better performance than a benchmark algorithm for identifying highly matched BLOCKS and PRINTS motifs.
Lo, D, Khoo, S & Li, J 2008, 'Mining and Ranking Generators of Sequential Patterns', Proceedings of the 2008 SIAM International Conference on Data Mining, SIAM, Atlanta, pp. 553-564.
Lukman, S, Sim, K, Li, J & Chen, YPP 2008, 'Interacting amino acid preferences of 3D pattern pairs at the binding sites of transient and obligate protein complexes', Series on Advances in Bioinformatics and Computational Biology, pp. 69-78.
To assess the physico-chemical characteristics of protein-protein interactions, protein sequences and overall structural folds have been analyzed previously. To highlight this, discovery and examination of amino acid patterns at the binding sites defined by structural proximity in 3-dimensional (3D) space are essential. In this paper, we investigate the interacting preferences of 3D pattern pairs discovered separately in transient and obligate protein complexes. These 3D pattern pairs are not necessarily sequence-consecutive, but each residue in two groups of amino acids from two proteins in a complex is within certain Å threshold to most residues in the other group. We develop an algorithm called AApairs by which every pair of interacting proteins is represented as a bipartite graph, and it discovers all maximal quasi-bicliques from every bipartite graph to form our 3D pattern pairs. From 112 and 2533 highly conserved 3D pattern pairs discovered in the transient and obligate complexes respectively, we observe that Ala and Leu is the highest occuring amino acid in interacting 3D patterns of transient (20.91%) and obligate (33.82%) complexes respectively. From the study on the dipeptide composition on each side of interacting 3D pattern pairs, dipeptides Ala-Ala and Ala-Leu are popular in 3D patterns of both transient and obligate complexes. The interactions between amino acids with large hydrophobicity difference are present more in the transient than in the obligate complexes. On contrary, in obligate complexes, interactions between hydrophobic residues account for the top 5 most occuring amino acid pairings.
Ben, N, Yang, Q, Li, J, Shin, C-K & Pal, S 2007, 'Discovering patterns of DNA methylation: Rule mining with rough sets and decision trees, and comethylation analysis', PATTERN RECOGNITION AND MACHINE INTELLIGENCE, PROCEEDINGS, 2nd International Conference on Pattern Recognition and Machine Intelligence, SPRINGER-VERLAG BERLIN, Calcutta, INDIA, pp. 389-+.
Feng, M, Dong, G, Li, J, Tan, Y-P & Wong, L 2007, 'Evolution and maintenance of frequent pattern space when transactions are removed', ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PROCEEDINGS, 11th Pacific-Asia Conference on Knowledge Discovery and Data Mining, SPRINGER-VERLAG BERLIN, Nanjing, PEOPLES R CHINA, pp. 489-+.
Li, J & Hu, X 2007, 'Workshop BioDM'07 An overviewa', EMERGING TECHNOLOGIES IN KNOWLEDGE DISCOVERY AND DATA MINING, 11th Pacific-Asia Conference on Knowledge Discovery and Data Mining, SPRINGER-VERLAG BERLIN, Nanjing, PEOPLES R CHINA, pp. 110-+.
Li, J, Liu, G & Wong, L 2007, 'Mining Statistically Important Equivalence Classes and Delta-Discriminative Emerging Patterns', KDD-2007 PROCEEDINGS OF THE THIRTEENTH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 13th International Conference on Knowledge Discovery and Data Mining, ASSOC COMPUTING MACHINERY, San Jose, CA, pp. 430-+.
Liu, G, Li, J, Sim, K & Wong, L 2007, 'Distance based subspace clustering with flexible dimension partitioning', Proceedings - International Conference on Data Engineering, pp. 1250-1254.View/Download from: Publisher's site
Traditional similarity or distance measurements usually become meaningless when the dimensions of the datasets increase, which has detrimental effects on clustering performance. In this paper, we propose a distance-based subspace clustering model, called nCiuster, to find groups of objects that have similar values on subsets of dimensions. Instead of using a grid based approach to partition the data space into non-overlapping rectangle cells as in the density based subspace clustering algorithms, the nCiuster model uses a more flexible method to partition the dimensions to preserve meaningful and significant clusters. We develop an efficient algorithm to mine only maximal nClusters. A set of experiments are conducted to show the efficiency of the proposed algorithm and the effectiveness of the new model in preserving significant clusters. © 2007 IEEE.
Liu, G, Li, J, Sim, K & Wong, L 2007, 'Distance based subspace clustering with flexible dimension partitioning', 2007 IEEE 23RD INTERNATIONAL CONFERENCE ON DATA ENGINEERING, VOLS 1-3, IEEE 23rd International Conference on Data Engineering, IEEE, Istanbul, TURKEY, pp. 1225-+.
Vellaisamy, K & Li, J 2007, 'Multidimensional decision support indicator (mDSI) for time series stock trend prediction', ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PROCEEDINGS, 11th Pacific-Asia Conference on Knowledge Discovery and Data Mining, SPRINGER-VERLAG BERLIN, Nanjing, PEOPLES R CHINA, pp. 841-+.
Chen, L, Bhowmick, SS & Li, J 2006, 'COWES: Clustering web users based on historical web sessions', DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, PROCEEDINGS, 11th International Conference on Database Systems for Advanced Applications, SPRINGER-VERLAG BERLIN, Singapore, SINGAPORE, pp. 541-556.
Chen, L, Bhowmick, SS & Li, JY 2005, 'Mining temporal indirect associations', ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PROCEEDINGS, 10th Pacific-Asia Conference on Knowledge Discovery and Data Mining, SPRINGER-VERLAG BERLIN, Singapore, SINGAPORE, pp. 425-434.
Li, J, Li, H, Wong, L, Pei, J & Dong, G 2006, 'Minimum description length principle: Generators are preferable to closed patterns', Proceedings of the National Conference on Artificial Intelligence, pp. 409-414.
The generators and the unique closed pattern of an equivalence class of itemsets share a common set of transactions. The generators are the minimal ones among the equivalent itemsets, while the closed pattern is the maximum one. As a generator is usually smaller than the closed pattern in cardinality, by the Minimum Description Length Principle, the generator is preferable to the closed pattern in inductive inference and classification. To efficiently discover frequent generators from a large dataset, we develop a depth-first algorithm called Gr-growth. The idea is novel in contrast to traditional breadth-first bottom-up generator-mining algorithms. Our extensive performance study shows that Gr-growth is significantly faster (an order or even two orders of magnitudes when the support thresholds are low) than the existing generator mining algorithms. It can be also faster than the state-of-the-art frequent closed itemset mining algorithms such as FPclose and CLOSET+. Copyright © 2006, American Association for Artificial Intelligence (www.aaai.org). All rights reserved.
Liu, G, Li, J, Wong, L & Hsu, W 2006, 'Positive Borders or Negative Borders: How to Make Lossless Generator Based Representations Concise', PROCEEDINGS OF THE SIXTH SIAM INTERNATIONAL CONFERENCE ON DATA MINING, 6th SIAM International Conference on Data Mining, SIAM, Bethesda, MD, pp. 469-+.
Liu, G, Sim, K & Li, J 2006, 'Efficient mining of large maximal bicliques', DATA WAREHOUSING AND KNOWLEDGE DISCOVERY, PROCEEDINGS, 8th International Conference on Data Warehousing and Knowledge Discovery (DaWaK 2006), SPRINGER-VERLAG BERLIN, Cracow, POLAND, pp. 437-448.
Sim, K, Li, J, Gopalkrishnan, V & Liu, G 2006, 'Mining maximal Quasi-Bicliques to co-cluster stocks and financial ratios for value investment', ICDM 2006: SIXTH INTERNATIONAL CONFERENCE ON DATA MINING, PROCEEDINGS, 6th IEEE International Conference on Data Mining, IEEE COMPUTER SOC, Hong Kong, PEOPLES R CHINA, pp. 1059-1063.
Vellaisamy, K & Li, J 2006, 'Bayesian approaches to ranking sequential patterns interestingness', PRICAI 2006: TRENDS IN ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 9th Pacific Rim International Conference on Artificial Intelligence (PRICAI 2006), SPRINGER-VERLAG BERLIN, Guilin, PEOPLES R CHINA, pp. 241-250.
Dong, GZ, Jiang, CY, Pei, J, Li, JY & Wong, L 2005, 'Mining succinct systems of minimal generators of formal concepts', DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, PROCEEDINGS, 10th International Conference on Database Systems for Advanced Applications (DASFAA 2005), SPRINGER-VERLAG BERLIN, Beijing, PEOPLES R CHINA, pp. 175-187.
Li, H, Li, J, Wong, L, Feng, M & Tan, YP 2005, 'Relative risk and odds ratio: A data mining perspective', Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pp. 368-377.
We are often interested to test whether a given cause has a given effect. If we cannot specify the nature of the factors involved, such tests are called model-free studies. There are two major strategies to demonstrate associations between risk factors (ie. patterns) and outcome phenotypes (ie. class labels). The first is that of prospective study designs, and the analysis is based on the concept of "relative risk": What fraction of the exposed (ie. has the pattern) or unexposed (ie. lacks the pattern) individuals have the phenotype (ie. the class label)? The second is that of retrospective designs, and the analysis is based on the concept of "odds ratio": The odds that a case has been exposed to a risk factor is compared to the odds for a case that has not been exposed. The efficient extraction of patterns that have good relative risk and/or odds ratio has not been previously studied in the data mining context. In this paper, we investigate such patterns. We show that this pattern space can be systematically stratified into plateaus of convex spaces based on their support levels. Exploiting convexity, we formulate a number of sound and complete algorithms to extract the most general and the most specific of such patterns at each support level. We compare these algorithms. We further demonstrate that the most efficient among these algorithms is able to mine these sophisticated patterns at a speed comparable to that of mining frequent closed patterns, which are patterns that satisfy considerably simpler conditions. Copyright 2005 ACM.
Li, JY, Du, JM, Fu, XY & Li, BR 2005, 'Pressure and vacuum servo control system based on vacuum pump and study of the system control', Proceedings of the Sixth International Conference on Fluid Power Transmission and Control, 6th International Conference on Fluid Power Transmission and Control (ICFP 2005), INTERNATIONAL ACADEMIC PUBLISHERS LTD, Zhejiang Univ, Hangzhou, PEOPLES R CHINA, pp. 368-372.
Li, JY, Li, HQ, Soh, D & Wong, L 2005, 'A correspondence between maximal complete bipartite subgraphs and closed patterns', KNOWLEDGE DISCOVERY IN DATABASES: PKDD 2005, 16th European Conference on Machine Learning (ECML)/9th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD), SPRINGER-VERLAG BERLIN, Oporto, PORTUGAL, pp. 146-156.
Li, JY, Liu, HQ & Li, L 2005, 'Diagnostic rules induced by an ensemble method for childhood leukemia', BIBE 2005: 5th IEEE Symposium on Bioinformatics and Bioengineering, 5th IEEE Symposium on Bioinformatics and Bioengineering, IEEE COMPUTER SOC, Minneapolis, MN, pp. 246-249.
Li, J, Tu, ZY & Blum, RS 2004, 'Slepian-Wolf coding for nonuniform sources using turbo codes', DCC 2004: DATA COMPRESSION CONFERENCE, PROCEEDINGS, Data Compression Conference (DCC 2004), IEEE COMPUTER SOC, Snowbird, UT, pp. 312-321.
Li, JY & Ramamohanarao, K 2004, 'A tree-based approach to the discovery of diagnostic biomarkers for ovarian cancer', ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PROCEEDINGS, 8th Pacific/Asia Conference on Advances in Knowledge Discovery and Data Mining, SPRINGER-VERLAG BERLIN, Sydney, AUSTRALIA, pp. 682-691.
Liu, H, Li, J & Wong, L 2004, 'Selection of patient samples and genes for outcome prediction', Proceedings - 2004 IEEE Computational Systems Bioinformatics Conference, CSB 2004, pp. 382-392.
Gene expression profiles with clinical outcome data enable monitoring of disease progression and prediction of patient survival at the molecular level. We present a new computational method for outcome prediction. Our idea is to use an informative subset of original training samples. This subset consists of only short-term survivors who died within a short period and long-term survivors who were still alive after a long follow-up time. These extreme training samples yield a clear platform to identify genes whose expression is related to survival. To find relevant genes, we combine two feature selection methods - entropy measure and Wilcoxon rank sum test - so that a set of sharp discriminating features are identified. The selected training samples and genes are then integrated by a support vector machine to build a prediction model, by which each validation sample is assigned a survival/relapse risk score for drawing Kaplan-Meier survival curves. We apply this method to two data sets: diffuse large-B-cell lymphoma (DLBCL) and primary lung adenocarcinoma. In both cases, patients in high and low risk groups stratified by our risk scores are clearly distinguishable. We also compare our risk scores to some clinical factors, such as International Prognostic Index score for DLBCL analysis and tumor stage information for lung adenocarcinoma. Our results indicate that gene expression profiles combined with carefully chosen learning algorithms can predict patient survival for certain diseases.
Li, JY & Liu, HQ 2003, 'Ensembles of cascading trees', THIRD IEEE INTERNATIONAL CONFERENCE ON DATA MINING, PROCEEDINGS, 3rd IEEE International Conference on Data Mining, IEEE COMPUTER SOC, MELBOURNE, FL, pp. 585-588.
Li, J & Wong, L 2002, 'Geography of differences between two classes of data', Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), pp. 325-337.
Easily comprehensible ways of capturing main differences between two classes of data are investigated in this paper. In addition to examining individual differences, we also consider their neighbourhood. The new concepts are applied to three gene expression datasets to discover diagnostic gene groups. Based on the idea of prediction by collective likelihoods (PCL), a new method is proposed to classify testing samples. Its performance is competitive to several state-of-the-art algorithms. © 2002 Springer-Verlag Berlin Heidelberg.
Li, JY & Wong, L 2002, 'Solving the fragmentation problem of decision trees by discovering boundary emerging patterns', 2002 IEEE INTERNATIONAL CONFERENCE ON DATA MINING, PROCEEDINGS, 2nd IEEE International Conference on Data Mining, IEEE COMPUTER SOC, MAEBASHI CITY, JAPAN, pp. 653-656.
Li, J, Ramamohanarao, K & Dong, G 2001, 'Combining the strength of pattern frequency and distance for classification', Lecture Notes in Artificial Intelligence (Subseries of Lecture Notes in Computer Science), pp. 455-466.
© Springer-Verlag Berlin Heidelberg 2001. Supervised classification involves many heuristics, including the ideas of decision tree, k-nearest neighbour (k-NN), pattern frequency, neural network, and Bayesian rule, to base induction algorithms. In this paper, we propose a new instance-based induction algorithm which combines the strength of pattern frequency and distance. We define a neighbourhood of a test instance. If the neighbourhood contains training data, we use k-NN to make decisions. Otherwise, we examine the support (frequency) of certain types of subsets of the test instance, and calculate support summations for prediction. This scheme is intended to deal with outliers: when no training data is near to a test instance, then the distance measure is not a proper predictor for classification. We present an effective method to choose an "optimal" neighbourhood factor for a given data set by using a guidance from a partial training data. In this work, we find that our algorithm maintains (sometimes exceeds) the outstanding accuracy of k-NN on data sets containing pure continuous attributes, and that our algorithm greatly improves the accuracy of k-NN on data sets containing a mixture of continuous and categorical attributes. In general, our method is much superior to C5.0.
Li, J, Dong, G & Ramamohanarao, K 2000, 'Instance-Based Classification by Emerging Patterns', LECTURE NOTES IN COMPUTER SCIENCE , SPRINGER, pp. 191-200.
Li, JY, Ramamohanarao, K & Dong, GZ 2000, 'Emerging patterns and classification', ADVANCES IN COMPUTING SCIENCE-ASIAN 2000, PROCEEDINGS, 6th Asian Computing Science Conference, SPRINGER-VERLAG BERLIN, GEORGE TOWN, MALAYSIA, pp. 15-32.
Dong, GZ, Zhang, XZ, Wong, LS & Li, JY 1999, 'CAEP: Classification by aggregating emerging patterns', DISCOVERY SCIENCE, PROCEEDINGS, 2nd International Conference on Discovery Science (DS 99), SPRINGER-VERLAG BERLIN, TOKYO, JAPAN, pp. 30-42.
Li, JY, Zhang, XZ, Dong, GZ, Ramamohanarao, K & Sun, Q 1999, 'Efficient mining of high confidence association rules without support thresholds', PRINCIPLES OF DATA MINING AND KNOWLEDGE DISCOVERY, 3rd European Conference on Principles of Data Mining and Knowledge Discovery in Databases (PKDD 99), SPRINGER-VERLAG BERLIN, UNIV ECON, LAB INTELLIGENT SYST, PRAGUE, CZECH REPUBLIC, pp. 406-411.
Dong, GZ & Li, JY 1998, 'Interestingness of discovered association rules in terms of neighborhood-based unexpectedness', RESEARCH AND DEVELOPMENT IN KNOWLEDGE DISCOVERY AND DATA MINING, 2nd Pacific-Asia Conference on Research and Development in Knowledge Discovery and Data Mining (PAKDD-98), SPRINGER-VERLAG BERLIN, MELBOURNE, AUSTRALIA, pp. 72-86.
Li, WC, Wang, YB, Li, WJ, Zhang, J & Li, JY 1998, 'Sparselized higher-order neural network and its pruning algorithm', IEEE WORLD CONGRESS ON COMPUTATIONAL INTELLIGENCE, 2nd IEEE World Congress on Computational Intelligence (WCCI 98), IEEE, ANCHORAGE, AK, pp. 359-362.