Can supervise: YES
Gheisari, S, Catchpoole, DR, Charlton, A & Kennedy, PJ 2018, 'Convolutional Deep Belief Network with Feature Encoding for Classification of Neuroblastoma Histological Images.', Journal of Pathology Informatics, vol. 9, pp. 1-28.View/Download from: UTS OPUS or Publisher's site
Neuroblastoma is the most common extracranial solid tumor in children younger than 5 years old. Optimal management of neuroblastic tumors depends on many factors including histopathological classification. The gold standard for classification of neuroblastoma histological images is visual microscopic assessment. In this study, we propose and evaluate a deep learning approach to classify high-resolution digital images of neuroblastoma histology into five different classes determined by the Shimada classification.We apply a combination of convolutional deep belief network (CDBN) with feature encoding algorithm that automatically classifies digital images of neuroblastoma histology into five different classes. We design a three-layer CDBN to extract high-level features from neuroblastoma histological images and combine with a feature encoding model to extract features that are highly discriminative in the classification task. The extracted features are classified into five different classes using a support vector machine classifier.We constructed a dataset of 1043 neuroblastoma histological images derived from Aperio scanner from 125 patients representing different classes of neuroblastoma tumors.The weighted average F-measure of 86.01% was obtained from the selected high-level features, outperforming state-of-the-art methods.The proposed computer-aided classification system, which uses the combination of deep architecture and feature encoding to learn high-level features, is highly effective in the classification of neuroblastoma histological images.
Gheisari, S, Catchpoole, DR, Charlton, A, Melegh, Z, Gradhand, E & Kennedy, PJ 2018, 'Computer Aided Classification of Neuroblastoma Histological Images Using Scale Invariant Feature Transform with Feature Encoding.', Diagnostics, vol. 8, no. 3, pp. 1-18.View/Download from: UTS OPUS or Publisher's site
Neuroblastoma is the most common extracranial solid malignancy in early childhood. Optimal management of neuroblastoma depends on many factors, including histopathological classification. Although histopathology study is considered the gold standard for classification of neuroblastoma histological images, computers can help to extract many more features some of which may not be recognizable by human eyes. This paper, proposes a combination of Scale Invariant Feature Transform with feature encoding algorithm to extract highly discriminative features. Then, distinctive image features are classified by Support Vector Machine classifier into five clinically relevant classes. The advantage of our model is extracting features which are more robust to scale variation compared to the Patched Completed Local Binary Pattern and Completed Local Binary Pattern methods. We gathered a database of 1043 histologic images of neuroblastic tumours classified into five subtypes. Our approach identified features that outperformed the state-of-the-art on both our neuroblastoma dataset and a benchmark breast cancer dataset. Our method shows promise for classification of neuroblastoma histological images.
Narayan, N, Morenos, L, Phipson, B, Willis, SN, Brumatti, G, Eggers, S, Lalaoui, N, Brown, LM, Kosasih, HJ, Bartolo, RC, Zhou, L, Catchpoole, D, Saffery, R, Oshlack, A, Goodall, GJ & Ekert, PG 2017, 'Functionally distinct roles for different miR-155 expression levels through contrasting effects on gene expression, in acute myeloid leukaemia', Leukemia, vol. 31, no. 4, pp. 808-820.View/Download from: Publisher's site
Enforced expression of microRNA-155 (miR-155) in myeloid cells has been shown to have both oncogenic or tumour-suppressor functions in acute myeloid leukaemia (AML). We sought to resolve these contrasting effects of miR-155 overexpression using murine models of AML and human paediatric AML data sets. We show that the highest miR-155 expression levels inhibited proliferation in murine AML models. Over time, enforced miR-155 expression in AML in vitro and in vivo, however, favours selection of intermediate miR-155 expression levels that results in increased tumour burden in mice, without accelerating the onset of disease. Strikingly, we show that intermediate and high miR-155 expression also regulate very different subsets of miR-155 targets and have contrasting downstream effects on the transcriptional environments of AML cells, including genes involved in haematopoiesis and leukaemia. Furthermore, we show that elevated miR-155 expression detected in paediatric AML correlates with intermediate and not high miR-155 expression identified in our experimental models. These findings collectively describe a novel dose-dependent role for miR-155 in the regulation of AML, which may have important therapeutic implications.
Anaissi, A, Goyal, M, Catchpoole, DR, Braytee, A & Kennedy, PJ 2016, 'Ensemble Feature Learning of Genomic Data Using Support Vector Machine.', PLoS ONE, vol. 11, no. 6, pp. 1-17.View/Download from: UTS OPUS or Publisher's site
The identification of a subset of genes having the ability to capture the necessary information to distinguish classes of patients is crucial in bioinformatics applications. Ensemble and bagging methods have been shown to work effectively in the process of gene selection and classification. Testament to that is random forest which combines random decision trees with bagging to improve overall feature selection and classification accuracy. Surprisingly, the adoption of these methods in support vector machines has only recently received attention but mostly on classification not gene selection. This paper introduces an ensemble SVM-Recursive Feature Elimination (ESVM-RFE) for gene selection that follows the concepts of ensemble and bagging used in random forest but adopts the backward elimination strategy which is the rationale of RFE algorithm. The rationale behind this is, building ensemble SVM models using randomly drawn bootstrap samples from the training set, will produce different feature rankings which will be subsequently aggregated as one feature ranking. As a result, the decision for elimination of features is based upon the ranking of multiple SVM models instead of choosing one particular model. Moreover, this approach will address the problem of imbalanced datasets by constructing a nearly balanced bootstrap sample. Our experiments show that ESVM-RFE for gene selection substantially increased the classification performance on five microarray datasets compared to state-of-the-art methods. Experiments on the childhood leukaemia dataset show that an average 9% better accuracy is achieved by ESVM-RFE over SVM-RFE, and 5% over random forest based approach. The selected genes by the ESVM-RFE algorithm were further explored with Singular Value Decomposition (SVD) which reveals significant clusters with the selected data.
Anaissi, A, Goyal, M, Catchpoole, DR, Braytee, A & Kennedy, PJ 2015, 'Case-based retrieval framework for gene expression data.', Cancer Informatics, vol. 14, pp. 21-31.View/Download from: UTS OPUS or Publisher's site
BACKGROUND: The process of retrieving similar cases in a case-based reasoning system is considered a big challenge for gene expression data sets. The huge number of gene expression values generated by microarray technology leads to complex data sets and similarity measures for high-dimensional data are problematic. Hence, gene expression similarity measurements require numerous machine-learning and data-mining techniques, such as feature selection and dimensionality reduction, to be incorporated into the retrieval process. METHODS: This article proposes a case-based retrieval framework that uses a k-nearest-neighbor classifier with a weighted-feature-based similarity to retrieve previously treated patients based on their gene expression profiles. RESULTS: The herein-proposed methodology is validated on several data sets: a childhood leukemia data set collected from The Children's Hospital at Westmead, as well as the Colon cancer, the National Cancer Institute (NCI), and the Prostate cancer data sets. Results obtained by the proposed framework in retrieving patients of the data sets who are similar to new patients are as follows: 96% accuracy on the childhood leukemia data set, 95% on the NCI data set, 93% on the Colon cancer data set, and 98% on the Prostate cancer data set. CONCLUSION: The designed case-based retrieval framework is an appropriate choice for retrieving previous patients who are similar to a new patient, on the basis of their gene expression data, for better diagnosis and treatment of childhood leukemia. Moreover, this framework can be applied to other gene expression data sets using some or all of its steps.
Song, R, Catchpoole, DR, Kennedy, PJ & Li, J 2015, 'Identification of lung cancer miRNA-miRNA co-regulation networks through a progressive data refining approach', JOURNAL OF THEORETICAL BIOLOGY, vol. 380, pp. 271-279.View/Download from: Publisher's site
Ghous, H, Kennedy, PJ, Ho, N & Catchpoole, DR 2014, 'Comparing Functional Visualisations of Lists of Genes using Singular Value Decomposition', Journal of Research and Practice in Information Technology, vol. 47, no. 1, pp. 47-76.
Progress in understanding core pathways of cancer requires analysis of many genes. New insights are
hampered due to the lack of tools to make sense of large lists of genes identifi ed using high throughput
technology. Data mining, particularly visualisation that fi nds relationships between genes and the Gene
Ontology (GO), can assist in functional understanding. This paper addresses the question using GO
annotations for functional understanding of genes. We augment genes with GO terms using two similarity
measures: a Hop-based measure and an Information Content based measure, and visualise with Singular
Value Decomposition (SVD). The results demonstrate that SVD visualisation of GO augmented genes
matches the biological understanding expected in simulated and real-life data. Diff erences are observed in
visualisation of GO terms, where the information content method produces more tightly-packed clusters
than the hop-based method.
Anaissi, A, Kennedy, PJ, Goyal, M & Catchpoole, DR 2013, 'A balanced iterative random forest for gene selection from microarray data', BMC Bioinformatics, vol. 14, pp. 1-10.View/Download from: UTS OPUS or Publisher's site
The wealth of gene expression values being generated by high throughput microarray technologies leads to complex high dimensional datasets. Moreover, many cohorts have the problem of imbalanced classes where the number of patients belonging to each class is not the same. With this kind of dataset, biologists need to identify a small number of informative genes that can be used as biomarkers for a disease.
This paper introduces a Balanced Iterative Random Forest (BIRF) algorithm to select the most relevant genes for a disease from imbalanced high-throughput gene expression microarray data. Balanced iterative random forest is applied on four cancer microarray datasets: a childhood leukaemia dataset, which represents the main target of this paper, collected from The Children's Hospital at Westmead, NCI 60, a Colon dataset and a Lung cancer dataset. The results obtained by BIRF are compared to those of Support Vector Machine-Recursive Feature Elimination (SVM-RFE), Multi-class SVM-RFE (MSVM-RFE), Random Forest (RF) and Naive Bayes (NB) classifiers. The results of the BIRF approach outperform these state-of-the-art methods, especially in the case of imbalanced datasets. Experiments on the childhood leukaemia dataset show that a 7% ∼ 12% better accuracy is achieved by BIRF over MSVM-RFE with the ability to predict patients in the minor class. The informative biomarkers selected by the BIRF algorithm were validated by repeating training experiments three times to see whether they are globally informative, or just selected by chance. The results show that 64% of the top genes consistently appear in the three lists, and the top 20 genes remain near the top in the other three lists.
The designed BIRF algorithm is an appropriate choice to select genes from imbalanced high-throughput gene expression microarray data. BIRF outperforms the state-of-the-art methods, especially the ability to handle the class-imbalanced data. Moreover, the analysis...
Tafavogh, S, Felix Navarro, KM, Catchpoole, DR & Kennedy, PJ 2013, 'Non-parametric and integrated framework for segmenting and counting neuroblastic cells within neuroblastoma tumor images', Medical & biological engineering & computing, vol. 51, no. 6, pp. 645-665.View/Download from: UTS OPUS or Publisher's site
Neuroblastoma is a malignant tumor and a cancer in childhood that derives from the neural crest. The number of neuroblastic cells within the tumor provides significant prognostic information for pathologists. An enormous number of neuroblastic cells makes the process of counting tedious and error-prone. We propose a user interaction-independent framework that segments cellular regions, splits the overlapping cells and counts the total number of single neuroblastic cells. Our novel segmentation algorithm regards an image as a feature space constructed by joint spatial-intensity features of color pixels. It clusters the pixels within the feature space using mean-shift and then partitions the image into multiple tiles. We propose a novel color analysis approach to select the tiles with similar intensity to the cellular regions. The selected tiles contain a mixture of single and overlapping cells. We therefore also propose a cell counting method to analyse morphology of the cells and discriminate between overlapping and single cells. Ultimately, we apply watershed to split overlapping cells. The results have been evaluated by a pathologist. Our segmentation algorithm was compared against adaptive thresholding. Our cell counting algorithm was compared with two state of the art algorithms. The overall cell counting accuracy of the system is 87.65 %
Ubaudi, FA, Kennedy, PJ, Catchpoole, DR, Guo, D & Simoff, SJ 2009, 'Microarray data mining: selecting trustworthy genes with gene feature ranking' in Data Mining for Business Applications, Springer, New York, USA, pp. 159-168.View/Download from: UTS OPUS or Publisher's site
Gene expression datasets used in biomedical data mining frequently have two characteristics: they have many thousand attributes but only relatively few sample points and the measurements are noisy. In other words, individual expression measurements may be untrustworthy. Gene Feature Ranking (GFR) is a feature selection methodology that addresses these domain specific characteristics by selecting features (i.e. genes) based on two criteria: (i) how well the gene can discriminate between classes of patient and (ii) the trustworthiness of the microarray data associated with the gene. An example from the pediatric cancer domain demonstrates the use of GFR and compares its performance with a feature selection method that does not explicitly address the trustworthiness of the underlying data.
Kennedy, PJ, Simoff, SJ, Catchpoole, DR, Skillicorn, D, Ubaudi, FA & Aloqaily, A 2008, 'Integrative visual data mining of biomedical data: Investigating cases in Chronic Fatigue Syndrome and Acute Lymphoblastic Leukaemia' in Simoff, SJ, Bohlen, MH & Mazeika, A (eds), Visual Data Mining: Theory, Techniques and Tools for Visual Analytics, Springer, Berlin Heidelberg, pp. 367-388.View/Download from: UTS OPUS or Publisher's site
This chapter presents an integrative visual data mining approach towards biomedical data. This approach and supporting methodology are presented at a high level. They combine in a consistent manner a set of visualisation and data mining techniques that operate over an integrated data set of several diverse components, including medical (clinical) data, patient outcome and interview data, corresponding gene expression and SNP data, domain ontologies and health management data. The practical application of the methodology and the specific data mining techniques engaged are demonstrated on two case studies focused on the biological mechanisms of two different types of diseases: Chronic Fatigue Syndrome and Acute Lymphoblastic Leukaemia, respectively. The common between the cases is the structure of the data sets.
Gheisari, S, Catchpoole, DR, Charlton, A & Kennedy, PJ 2018, 'Patched completed local binary pattern is an effective method for neuroblastoma histological image classification', Communications in Computer and Information Science, Australasian Conference on Data Mining, Bathurst, NSW, Australia, pp. 57-71.View/Download from: UTS OPUS or Publisher's site
© Springer Nature Singapore Pte Ltd. 2018. Neuroblastoma is the most common extra cranial solid tumour in children. The histology of neuroblastoma has high intra-class variation, which misleads existing computer-aided histological image classification methods that use global features. To tackle this problem, we propose a new Patched Completed Local Binary Pattern (PCLBP) method combining Sign Binary Pattern (SBP) and Magnitude Binary Pattern (MBP) within local patches to build feature vectors which are classified by k-Nearest Neighbor (k-NN) and Support Vector Machine (SVM) classifiers. The advantage of our method is extracting local features which are more robust to intra-class variation compared to global ones. We gathered a database of 1043 histologic images of neuroblastic tumours classified into five subtypes. Our experiments show the proposed method improves the weighted average F-measure by 1.89% and 0.81% with k-NN and SVM classifiers, respectively.
Roberts, AGK, Catchpoole, DR & Kennedy, PJ 2018, 'Variance-based Feature Selection for Classification of Cancer Subtypes Using Gene Expression Data', Proceedings of the International Joint Conference on Neural Networks, International Joint Conference on Neural Networks, IEEE, Rio de Janeiro, Brazil.View/Download from: UTS OPUS or Publisher's site
© 2018 IEEE. Classification in cancer has traditionally relied on feature selection by differential expression as a first step, where genes are selected according to the strength of evidence for a consistent difference in expression level between classes. However, recent work has shown that many genes also differ in the variance of their gene expression between disease states, and in particular between cancers of different types, prognosis, or stages of development. Features selected based on increased variance in cancer or differences in variance between tumours of differing prognosis have been used to successfully predict tumour progression or prognosis within the same cancer type, and to classify cancer subtypes in cases where there is an overall increase in variance in one class over the other. Here, we apply feature selection by differential variance to the more general problem of classification of cancer subtypes. We show that classifiers using features selected by differential variance are able to distinguish between clinically relevant cancer subtypes, that these classifiers perform as well as classifiers based on features selected by differential expression, and that combining the two approaches often gives better classification results than either feature selection method alone.
Nguyen, Q, Lau, CW, Qu, Z, Simoff, SJ, Huang, M & Catchpoole, DR 2018, 'A Mobile Tool for Interactive Visualisation of Genomics Data', Proceedings of ITME 2018, International Conference on Information Technology in Medicine and Education, IEEE Computer Society CPS, Hangzhou, Zhejiang, China, pp. 688-697.View/Download from: UTS OPUS or Publisher's site
Advancement in genomic research and technology has significantly improved our understandings of biology, health, and medicine. Genomics data are very complex and contain genotype and phenotype information. Health researchers have long known that many diseases such as cancer are hereditary. Gaining insight and understanding of such data would enable a better understanding of the correlation between genes and diseases, which could facilitate personalised treatment for the patients. Visualisations have been increasingly used to break the complexity of genomics data to guide better decisions. Unfortunately, research works on interactive visualisations of the genomics data on mobile devices and immersive platforms are still limited. This paper presents a new interactive visualisation and navigation of genomics data on the mobile platform. The visualisation provides an overview of the entire patient cohort in a 3D similarity-space environment as well as 2D detail views of genes of interests. We introduce a new algorithm that enables effective touch-based interaction and exploration of large number of items on small mobile screens. We illustrate the effectiveness of our platform through a childhood cancer dataset, B-cell Acute Lymphoblastic Leukaemia (ALL) as well as a pilot qualitative study with the domain experts.
Braytee, A, Liu, W, Catchpoole, DR & Kennedy, PJ 2017, 'Multi-label feature selection using correlation information', International Conference on Information and Knowledge Management, Proceedings, ACM on Conference on Information and Knowledge Management, ACM, Singapore, Singapore, pp. 1649-1656.View/Download from: UTS OPUS or Publisher's site
© 2017 ACM. High-dimensional multi-labeled data contain instances, where each instance is associated with a set of class labels and has a large number of noisy and irrelevant features. Feature selection has been shown to have great benefits in improving the classification performance in machine learning. In multi-label learning, to select the discriminative features among multiple labels, several challenges should be considered: interdependent labels, different instances may share different label correlations, correlated features, and missing and .awed labels. This work is part of a project at .e Children's Hospital at Westmead (TB-CHW), Australia to explore the genomics of childhood leukaemia. In this paper, we propose a CMFS (Correlated-and Multi-label Feature Selection method), based on non-negative matrix factorization (NMF) for simultaneously performing feature selection and addressing the aforementioned challenges. Significantly, a major advantage of our research is to exploit the correlation information contained in features, labels and instances to select the relevant features among multiple labels. Furthermore, l2;1-norm regularization is incorporated in the objective function to undertake feature selection by imposing sparsity on the feature matrix rows. We employ CMFS to decompose the data and multi-label matrices into a low-dimensional space. To solve the objective function, an efficient iterative optimization algorithm is proposed with guaranteed convergence. Finally, extensive experiments are conducted on high-dimensional multi-labeled datasets. The experimental results demonstrate that our method significantly outperforms state-of-the-art multi-label feature selection methods.
Tafavogh, S, Meng, Q, Catchpoole, DR & Kennedy, PJ 2014, 'Automated quantitative and qualitative analysis of whole neuroblastoma tumour images for prognosis', Proceedings of the IASTED 11th International Conference on Biomedical Engineering, IASTED International Conference on Biomedical Engineering, ACTA Press, Zurich, Switzerland, pp. 244-251.View/Download from: UTS OPUS or Publisher's site
Tafavogh, S, Felix Navarro, KM, Catchpoole, DR & Kennedy, PJ 2013, 'Segmenting Neuroblastoma Tumor Images and Splitting Overlapping Cells Using Shortest Paths between Cell Contour Convex Regions', Lecture Notes in Computer Science, Artificial Intelligence in Medicine in Europe, Elsevier, Murcia, Spain, pp. 171-175.View/Download from: UTS OPUS or Publisher's site
Neuroblastoma is one of the most fatal paediatric cancers. One of the major prognostic factors for neuroblastoma tumour is the total number of neuroblastic cells. In this paper, we develop a fully automated system for counting the total number of neuroblastic cells within the images derived from Hematoxylin and Eosin stained histological slides by considering the overlapping cells. We finally propose a novel multi-stage cell counting algorithm, in which cellular regions are extracted using an adaptive thresholding technique. Overlapping and single cells are discriminated using morphological differences. We propose a novel cell splitting algorithm to split overlapping cells into single cells using the shortest path between contours of convex regions
Ghous, H, Kennedy, PJ, Ho, N & Catchpoole, DR 2012, 'Functional Visualisation of Genes using Singular Value Decomposition', Proceedings of the 10th Australasian Data Mining Conference, Australian Data Mining Conference, Australian Computer Society, Sydney, Australia, pp. 53-60.View/Download from: UTS OPUS
Progress in understanding core pathways and processes of cancer requires thorough analysis of many coding regions of the genome. New insights are hampered due to the lack of tools to make sense of large lists of genes identified using high throughput technology. Data mining, particularly visualisation that finds relationships between genes and the Gene Ontology (GO), has the potential to assist in functional understanding. This paper addresses the question of how well GO annotations can help in functional understanding of genes. We augment genes with associated GO terms and visualise with Singular Value Decomposition (SVD). Meaning of derived components is further interpreted using correlations to GO terms. The results demonstrate that SVD visualisation of GOaugmented genes matches the biological understanding expected in the simulated data and presents understanding of childhood cancer genes that aligns with published results
Tafavogh, S, Kennedy, PJ & Catchpoole, DR 2012, 'Determining Cellularity Status of Tumors based on Histopathology using Hybrid Image Segmentation', International Joint Conference on Neural Networks, IEEE International Joint Conference on Neural Networks, IEEE, Brisbane, Australia, pp. 1-8.View/Download from: UTS OPUS or Publisher's site
A Computer Aided Diagnosis (CAD) system is developed to determine cellularity status of a tumor. The system helps pathologists to distinguish a tumor with cell proliferation from normal tumors. The developed CAD system implements a hybrid segmentation method to identify and extract the morphological features that are used by pathologists for determining cellularity status of tumor. Adaptive Mean Shift (AMS) clustering as a non-parametric technique is integrated with Color Template Matching (CTM) to construct segmentation approach. We used Expectation Maximization (EM) clustering as a parametric technique for the sake of comparison with our proposed approach. The output of our proposed system and EM are validated by two pathologists as ground truth. The result of our developed system is quite close to the decision of pathologists, and it significantly outperforms EM in terms of accuracy
Ghous, H, Ho, N, Catchpoole, DR & Kennedy, PJ 2011, 'Comparing functional visualizations of genes', The 5th International Workshop on Data Mining in Functional Genomics and Proteomics: Current Trends and Future Directions, International Workshop on Data Mining in Functional Genomics and Proteomics: Current Trends and Future Directions, European Conference on Machine Learning, Athens, Greece, pp. 12-21.View/Download from: UTS OPUS
Nguyen, Q, Gleeson, A, Ho, N, Huang, M, Simoff, SJ & Catchpoole, DR 2011, 'Visual Analytics of Clinical and Genetic Datasets of Acute Lymphoblastic Leukaemia', Lecture Notes in Computer Science (LNCS) 7062, International Conference on Neural Information Processing, Springer, Shanghai, China, pp. 113-120.View/Download from: UTS OPUS or Publisher's site
This paper presents a novel visual analytics method that incorporates knowledge from the analysis domain so that it can extract knowledge from complex genetic and clinical data and then visualizing them in a meaningful and interpretable way. The domain experts that are both contributors to formulating the requirements for the design of the system and the actual user of the system include microbiologists, biostatisticians, clinicians and computational biologists. A comprehensive prototype has been developed to support the visual analytics process. The system consists of multiple components enabling the complete analysis process, including data mining, interactive visualization, analytical views, gene comparison. A visual highlighting method is also implemented to support the decision making process. The paper demonstrates its effectiveness on a case study of childhood cancer patients.
Aloqaily, A, Kennedy, PJ, Catchpoole, DR & Simoff, SJ 2008, 'Comparison of visualization methods of genome-wide SNP profiles in childhood acute lymphoblastic leukemia', Data Mining and Analytics 2008: Proceedings of the Seventh Australasian Data Mining Conference (AusDM 2008), Conferences in Research and Practice in IT (CRPIT), Vol. 87, Australian Data Mining Conference, Australian Computer Society, Adelaide, Australia, pp. 111-121.View/Download from: UTS OPUS
Data mining and knowledge discovery have been applied to datasets in various industries including biomedical data. Modelling, data mining and visualization in biomedical data address the problem of extracting knowledge from large and complex biomedical data. The current challenge of dealing with such data is to develop statistical-based and data mining methods that search and browse the underlying patterns within the data. In this paper, we employ several data reduction methods for visualizing genome-wide Single Nucleotide Polymorphism (SNP) datasets based on state-of-art data reduction techniques. Visualization approach has been selected based on the trustworthiness of the resultant visualizations. To deal with large amounts of genetic variation data, we have chosen to apply different data reduction methods to deal with the problem induced by high dimensionality. Based on the trustworthiness metric we found that neighbour Retrieval Visualizer (NeRV) outperformed other methods. This method optimizes the retrieval quality of Stochastic neighbour Embedding. The quality measure of the visualization (i.e. NeRV) showed excellent results, even though the dataset was reduced from 13917 to 2 dimensions. The visualization results will assist clinicians and biomedical researchers in understanding the systems biology of patients and how to compare different groups of clusters in visualizations.
Ghous, H, Kennedy, PJ, Catchpoole, DR & Simoff, SJ 2008, 'Kernel-based visualisation of genes with the Gene Ontology', Data Mining and Analytics 2008: proceedings of the Seventh Australasian Data Mining Conference (AusDM 2008), Conferences in Research and Practice in IT (CRPIT), Vol. 87, Australian Data Mining Conference, Australian Computer Society, Adelaide, pp. 133-140.View/Download from: UTS OPUS
Kennedy, PJ, Simoff, SJ, Skillicorn, D & Catchpoole, DR 2004, 'Extracting and explaining biological knowledge in microarray data', Advances In Knowledge Discovery And Data Mining, Proceedings, Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer-Verlag Berlin, Sydney, Australia, pp. 699-703.View/Download from: UTS OPUS or Publisher's site
This paper describes a method of clustering lists of genes mined from a microarray dataset using functional information from the Gene Ontology. The method uses relationships between terms in the ontology both to build clusters and to extract meaningful cluster descriptions. The approach is general and may be applied to assist explanation of other datasets associated with ontologies.