© 2019 The Author(s). Background: In silico prediction of potential drug side-effects is of crucial importance for drug development, since wet experimental identification of drug side-effects is expensive and time-consuming. Existing computational methods mainly focus on leveraging validated drug side-effect relations for the prediction. The performance is severely impeded by the lack of reliable negative training data. Thus, a method to select reliable negative samples becomes vital in the performance improvement. Methods: Most of the existing computational prediction methods are essentially based on the assumption that similar drugs are inclined to share the same side-effects, which has given rise to remarkable performance. It is also rational to assume an inverse proposition that dissimilar drugs are less likely to share the same side-effects. Based on this inverse similarity hypothesis, we proposed a novel method to select highly-reliable negative samples for side-effect prediction. The first step of our method is to build a drug similarity integration framework to measure the similarity between drugs from different perspectives. This step integrates drug chemical structures, drug target proteins, drug substituents, and drug therapeutic information as features into a unified framework. Then, a similarity score between each candidate negative drug and validated positive drugs is calculated using the similarity integration framework. Those candidate negative drugs with lower similarity scores are preferentially selected as negative samples. Finally, both the validated positive drugs and the selected highly-reliable negative samples are used for predictions. Results: The performance of the proposed method was evaluated on simulative side-effect prediction of 917 DrugBank drugs, comparing with four machine-learning algorithms. Extensive experiments show that the drug similarity integration framework has superior capability in capturing drug features, achieving mu...
Ho, N, Peng, H, Mayoh, C, Liu, PY, Atmadibrata, B, Marshall, GM, Li, J & Liu, T 2018, 'Delineation of the frequency and boundary of chromosomal copy number variations in paediatric neuroblastoma', CELL CYCLE, vol. 17, no. 6, pp. 749-758.View/Download from: UTS OPUS or Publisher's site
Peng, H, Zheng, Y, Blumenstein, M, Tao, D & Li, J 2018, 'CRISPR/Cas9 cleavage efficiency regression through boosting algorithms and Markov sequence profiling', BIOINFORMATICS, vol. 34, no. 18, pp. 3069-3077.View/Download from: UTS OPUS or Publisher's site
Peng, H, Zheng, Y, Zhao, Z, Liu, T & Li, J 2018, 'Recognition of CRISPR/Cas9 off-target sites through ensemble learning of uneven mismatch distributions.', Bioinformatics, vol. 34, no. 17, pp. i757-i765.View/Download from: UTS OPUS or Publisher's site
Motivation:CRISPR/Cas9 is driving a broad range of innovative applications from basic biology to biotechnology and medicine. One of its current issues is the effect of off-target editing that should be critically resolved and should be completely avoided in the ideal use of this system. Results:We developed an ensemble learning method to detect the off-target sites of a single guide RNA (sgRNA) from its thousands of genome-wide candidates. Nucleotide mismatches between on-target and off-target sites have been studied recently. We confirm that there exists strong mismatch enrichment and preferences at the 5'-end close regions of the off-target sequences. Comparing with the on-target sites, sequences of no-editing sites can be also characterized by GC composition changes and position-specific mismatch binary features. Under this novel space of features, an ensemble strategy was applied to train a prediction model. The model achieved a mean score 0.99 of Aera Under Receiver Operating Characteristic curve and a mean score 0.45 of Aera Under Precision-Recall curve in cross-validations on big datasets, outperforming state-of-the-art methods in various test scenarios. Our predicted off-target sites also correspond very well to those detected by high-throughput sequencing techniques. Especially, two case studies for selecting sgRNAs to cure hearing loss and retinal degeneration partly prove the effectiveness of our method. Availability and implementation:The python and matlab version of source codes for detecting off-target sites of a given sgRNA and the supplementary files are freely available on the web at https://github.com/penn-hui/OfftargetPredict. Supplementary information:Supplementary data are available at Bioinformatics online.
Zhao, Z, Peng, H, Lan, C, Zheng, Y, Fang, L & Li, J 2018, 'Imbalance learning for the prediction of N-6-Methylation sites in mRNAs', BMC GENOMICS, vol. 19.View/Download from: UTS OPUS or Publisher's site
Zheng, Y, Peng, H, Zhang, X, Zhao, Z, Yin, J & Li, J 2018, 'Predicting adverse drug reactions of combined medication from heterogeneous pharmacologic databases.', BMC bioinformatics, vol. 19, no. Suppl 19, pp. 49-59.View/Download from: UTS OPUS or Publisher's site
BACKGROUND:Early and accurate identification of potential adverse drug reactions (ADRs) for combined medication is vital for public health. Existing methods either rely on expensive wet-lab experiments or detecting existing associations from related records. Thus, they inevitably suffer under-reporting, delays in reporting, and inability to detect ADRs for new and rare drugs. The current application of machine learning methods is severely impeded by the lack of proper drug representation and credible negative samples. Therefore, a method to represent drugs properly and to select credible negative samples becomes vital in applying machine learning methods to this problem. RESULTS:In this work, we propose a machine learning method to predict ADRs of combined medication from pharmacologic databases by building up highly-credible negative samples (HCNS-ADR). Specifically, we fuse heterogeneous information from different databases and represent each drug as a multi-dimensional vector according to its chemical substructures, target proteins, substituents, and related pathways first. Then, a drug-pair vector is obtained by appending the vector of one drug to the other. Next, we construct a drug-disease-gene network and devise a scoring method to measure the interaction probability of every drug pair via network analysis. Drug pairs with lower interaction probability are preferentially selected as negative samples. Following that, the validated positive samples and the selected credible negative samples are projected into a lower-dimensional space using the principal component analysis. Finally, a classifier is built for each ADR using its positive and negative samples with reduced dimensions. The performance of the proposed method is evaluated on simulative prediction for 1276 ADRs and 1048 drugs, comparing using four machine learning algorithms and with two baseline approaches. Extensive experiments show that the proposed way to represent drugs characterizes drugs accu...
Lan, C, Peng, H, McGowan, EM, Hutvagner, G & Li, J 2018, 'An isomiR expression panel based novel breast cancer classification approach using improved mutual information.', BMC medical genomics, vol. 11, no. Suppl 6, pp. 118-118.View/Download from: UTS OPUS or Publisher's site
BACKGROUND:Gene expression-based profiling has been used to identify biomarkers for different breast cancer subtypes. However, this technique has many limitations. IsomiRs are isoforms of miRNAs that have critical roles in many biological processes and have been successfully used to distinguish various cancer types. Biomarker isomiRs for identifying different breast cancer subtypes has not been investigated. For the first time, we aim to show that isomiRs are better performing biomarkers and use them to explain molecular differences between breast cancer subtypes. RESULTS:In this study, a novel method is proposed to identify specific isomiRs that faithfully classify breast cancer subtypes. First, as a null hypothesis method we removed the lowly expressed isomiRs from small sequencing data generated from diverse breast cancers types. Second, we developed an improved mutual information-based feature selection method to calculate the weight of each isomiR expression. The weight of isomiR measures the importance of a given isomiR in classifying breast cancer subtypes. The improved mutual information enables to apply the dataset in which the feature is continuous data and label is discrete data; whereby, the traditional mutual information cannot be applied in this dataset. Finally, the support vector machine (SVM) classifier is applied to find isomiR biomarkers for subtyping. CONCLUSIONS:Here we demonstrate that isomiRs can be used as biomarkers in the identification of different breast cancer subtypes, and in addition, they may provide new insights into the diverse molecular mechanisms of breast cancers. We have also shown that the classification of different subtypes of breast cancer based on isomiRs expression is more effective than using published gene expression profiling. The proposed method provides a better performance outcome than Fisher method and Hellinger method for discovering biomarkers to distinguish different breast cancer subtypes. This novel techniqu...
The rapidly increasing number of genomes generated by high-throughput sequencing platforms and assembly algorithms is accompanied by problems in data storage, compression and communication. Traditional compression algorithms are unable to meet the demand of high compression ratio due to the intrinsic challenging features of DNA sequences such as small alphabet size, frequent repeats and palindromes. Reference-based lossless compression, by which only the differences between two similar genomes are stored, is a promising approach with high compression ratio.We present a high-performance referential genome compression algorithm named HiRGC. It is based on a 2-bit encoding scheme and an advanced greedy-matching search on a hash table. We compare the performance of HiRGC with four state-of-the-art compression methods on a benchmark dataset of eight human genomes. HiRGC takes <30 min to compress about 21 gigabytes of each set of the seven target genomes into 96-260 megabytes, achieving compression ratios of 217 to 82 times. This performance is at least 1.9 times better than the best competing algorithm on its best case. Our compression speed is also at least 2.9 times faster. HiRGC is stable and robust to deal with different reference genomes. In contrast, the competing methods' performance varies widely on different reference genomes. More experiments on 100 human genomes from the 1000 Genome Project and on genomes of several other species again demonstrate that HiRGC's performance is consistently excellent.The C ++ and Java source codes of our algorithm are freely available for academic and non-commercial use. They can be downloaded from https://github.com/yuansliu/HiRGC.firstname.lastname@example.org.Supplementary data are available at Bioinformatics online.
Peng, H, Lan, C, Liu, Y, Liu, T, Blumenstein, M & Li, J 2017, 'Chromosome preference of disease genes and vectorization for the prediction of non-coding disease genes.', Oncotarget, vol. 8, no. 45, pp. 78901-78916.View/Download from: UTS OPUS or Publisher's site
Disease-related protein-coding genes have been widely studied, but disease-related non-coding genes remain largely unknown. This work introduces a new vector to represent diseases, and applies the newly vectorized data for a positive-unlabeled learning algorithm to predict and rank disease-related long non-coding RNA (lncRNA) genes. This novel vector representation for diseases consists of two sub-vectors, one is composed of 45 elements, characterizing the information entropies of the disease genes distribution over 45 chromosome substructures. This idea is supported by our observation that some substructures (e.g., the chromosome 6 p-arm) are highly preferred by disease-related protein coding genes, while some (e.g., the 21 p-arm) are not favored at all. The second sub-vector is 30-dimensional, characterizing the distribution of disease gene enriched KEGG pathways in comparison with our manually created pathway groups. The second sub-vector complements with the first one to differentiate between various diseases. Our prediction method outperforms the state-of-the-art methods on benchmark datasets for prioritizing disease related lncRNA genes. The method also works well when only the sequence information of an lncRNA gene is known, or even when a given disease has no currently recognized long non-coding genes.
Peng, H, Lan, C, Zheng, Y, Hutvagner, G, Tao, D & Li, J 2017, 'Cross disease analysis of co-functional microRNA pairs on a reconstructed network of disease-gene-microRNA tripartite.', BMC Bioinformatics, vol. 18, pp. 1-17.View/Download from: UTS OPUS or Publisher's site
MicroRNAs always function cooperatively in their regulation of gene expression. Dysfunctions of these co-functional microRNAs can play significant roles in disease development. We are interested in those multi-disease associated co-functional microRNAs that regulate their common dysfunctional target genes cooperatively in the development of multiple diseases. The research is potentially useful for human disease studies at the transcriptional level and for the study of multi-purpose microRNA therapeutics.We designed a computational method to detect multi-disease associated co-functional microRNA pairs and conducted cross disease analysis on a reconstructed disease-gene-microRNA (DGR) tripartite network. The construction of the DGR tripartite network is by the integration of newly predicted disease-microRNA associations with those relationships of diseases, microRNAs and genes maintained by existing databases. The prediction method uses a set of reliable negative samples of disease-microRNA association and a pre-computed kernel matrix instead of kernel functions. From this reconstructed DGR tripartite network, multi-disease associated co-functional microRNA pairs are detected together with their common dysfunctional target genes and ranked by a novel scoring method. We also conducted proof-of-concept case studies on cancer-related co-functional microRNA pairs as well as on non-cancer disease-related microRNA pairs.With the prioritization of the co-functional microRNAs that relate to a series of diseases, we found that the co-function phenomenon is not unusual. We also confirmed that the regulation of the microRNAs for the development of cancers is more complex and have more unique properties than those of non-cancer diseases.
Zheng, Y, Peng, H, Zhang, X, Gao, X & Li, J 2018, 'Predicting Drug Targets from Heterogeneous Spaces using Anchor Graph Hashing and Ensemble Learning', Proceedings of the International Joint Conference on Neural Networks, International Joint Conference on Neural Networks, IEEE, Rio de Janeiro, Brazil.View/Download from: UTS OPUS or Publisher's site
© 2018 IEEE. The in silico prediction of potential drug-targetinteractions is of critical importance in drug research. Existing computational methods have achieved remarkable prediction accuracy, however usually obtain poor prediction efficiency due to computational problems. To improve the prediction efficiency, we propose to predict drug targets based on inte- gration of heterogeneous features with anchor graph hashing and ensemble learning. First, we encode each drug as a 5682- bit vector, and each target as a 4198-bit vector using their heterogeneous features respectively. Then, these vectors are embedded into low-dimensional Hamming Space using anchor graph hashing. Next, we append hashing bits of a target to hashing bits of a drug as a vector to represent the drug-target pair. Finally, vectors of positive samples composed of known drug-target pairs and randomly selected negative samples are used to train and evaluate the ensemble learning model. The performance of the proposed method is evaluated on simulative target prediction of 1094 drugs from DrugBank. Ex- tensive comparison experiments demonstrate that the proposed method can achieve high prediction efficiency while preserving satisfactory accuracy. In fact, it is 99.3 times faster and only 0.001 less in AUC than the best literature method 'Pairwise Kernel Method'.
Lan, C, Peng, H, McGowan, EM, Hutvagner, G & Li, J 2018, 'An isomIR expression panel based novel breast cancer classification approach using improved mutual information', International Conference on Genome Informatics, Kunming, Yunnan, China.View/Download from: UTS OPUS
Zheng, Y, Lan, C, Peng, H & Li, J 2016, 'Using Constrained Information Entropy to Detect Rare Adverse Drug Reactions from Medical Forums', 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), International Conference of the IEEE Engineering in Medicine and Biology Society, IEEE, IEEE, pp. 2460-2463.View/Download from: UTS OPUS or Publisher's site
Adverse drug reactions (ADRs) detection is critical to avoid malpractices yet challenging due to its uncertainty in pre-marketing review and the underreporting in post-marketing surveillance. To conquer this predicament, social media based ADRs detection methods have been proposed recently. However, existing researches are mostly co-occurrence based methods and face several issues, in particularly, leaving out the rare ADRs and unable to distinguish irrelevant ADRs. In this work, we introduce a constrained information entropy (CIE) method to solve these problems. CIE first recognizes the drug-related adverse reactions using a predefined keyword dictionary and then captures high- and low-frequency (rare) ADRs by information entropy. Extensive experiments on medical forums dataset demonstrate that CIE outperforms the state-of-the-art co-occurrence based methods, especially in rare ADRs detection.