Professor Kennedy has a PhD (Computing Science) and joined UTS in 1999. He is Director of the Biomedical Data Science Laboratory in the UTS Centre for Artificial Intelligence. This Centre is a strategic investment area within the university and has been externally evaluated as one of top two ranked research groups in the university. The mission of the Biomedical Data Science Laboratory is to use knowledge as the infrastructure to support decision making in biomedicine, most notably by assisting clinicians and biologists in cancer diagnosis and treatment.
Dr Kennedy is co-initiator of a research collaboration (> 15 years) with the Children’s Hospital at Westmead, the University of Western Sydney and Queen’s University, Canada. The larger project involves developing a point-of-care clinical software tool to better diagnose and treat childhood leukaemia sufferers. The tool uses case-based reasoning approaches to help predict how paediatric cancer sufferers will react to treatment by comparing them to previous patients on the basis of their systems biology.
See also my personal website.
Prof Kennedy has been involved with the Australasian Data Mining Conference (AusDM) since 2007 and has been coeditor of the AusDM proceedings many times since 2006. He has actively contributed to many international Program Committees and reviewed for international journals and books. He is an ARC Expert Assessor and has coauthored over 100 publications. He is a member of the industry body, the Institute of Analytic Professionals Australia. He has been awarded grants exceeding $1.4M.
Dr Kennedy has considerable industry experience developing software and running DM projects. He spent 80% of 2008 in industry on DM projects (including successful bidding and completion of two worth $53,000) and has completed 10 projects since 2005 for major companies including Ford, Microsoft and News Ltd. A recent project mining networks of companies and directors formed the basis for articles written for the Australian Financial Review and nominated for a Walkley Award.
Can supervise: YES
Dr Kennedy's research interests are in data analytics of biomedical data, mainly collaborating with paediatric cancer researchers, since 2002, to better understand and predict treatment outcomes for childhood cancer sufferers. But he also explores other areas of data analytics and bioinformatics including developing bioinformatics pipelines to facilitate animal vaccine discovery and mapping collaboration among researchers. He also works on visualisation, text mining and social network analysis.
Prof Kennedy teaches data mining and software engineering classes at undergraduate and postgraduate levels. He usually teaches:
- Introduction to Data Analytics
- Fundamentals of Data Analytics
- Analytics Capstone Project B
- Advanced Data Analytics
- Advanced Data Analytics Algorithms
- Analytics Capstone Project
- and a guest lecture in Arguments, Evidence and Intuition
If you are interested in doing a project in data analytics or bioinformatics please email him: firstname.lastname@example.org
Cui, L, Wu, J, Pi, D, Zhang, P & Kennedy, P 2020, 'Dual Implicit Mining-Based Latent Friend Recommendation', IEEE Transactions on Systems Man and Cybernetics: Systems.View/Download from: Publisher's site
IEEE The latent friend recommendation in online social media is interesting, yet challenging, because the user-item ratings and the user-user relationships are both sparse. In this paper, we propose a new dual implicit mining-based latent friend recommendation model that simultaneously considers the implicit interest topics of users and the implicit link relationships between the users in the local topic cliques. Specifically, we first propose an algorithm called all reviews from a user and all tags from their corresponding items to learn the implicit interest topics of the users and their corresponding topic weights, then compute the user interest topic similarity using a symmetric Jensen-Shannon divergence. After that, we adopt the proposed weighted local random walk with restart algorithm to analyze the implicit link relationships between the users in the local topic cliques and calculate the weighted link relationship similarity between the users. Combining the user interest topic similarity with the weighted link relationship similarity in a unified way, we get the final latent friend recommendation list. The experiments on real-world datasets demonstrate that the proposed method outperforms the state-of-the-art latent friend recommendation methods under four different types of evaluation metrics.
Curiskis, SA, Drake, B, Osborn, TR & Kennedy, PJ 2020, 'An evaluation of document clustering and topic modelling in two online social networks: Twitter and Reddit', Information Processing and Management, vol. 57, no. 2.View/Download from: Publisher's site
© 2019 Elsevier Ltd Methods for document clustering and topic modelling in online social networks (OSNs) offer a means of categorising, annotating and making sense of large volumes of user generated content. Many techniques have been developed over the years, ranging from text mining and clustering methods to latent topic models and neural embedding approaches. However, many of these methods deliver poor results when applied to OSN data as such text is notoriously short and noisy, and often results are not comparable across studies. In this study we evaluate several techniques for document clustering and topic modelling on three datasets from Twitter and Reddit. We benchmark four different feature representations derived from term-frequency inverse-document-frequency (tf-idf) matrices and word embedding models combined with four clustering methods, and we include a Latent Dirichlet Allocation topic model for comparison. Several different evaluation measures are used in the literature, so we provide a discussion and recommendation for the most appropriate extrinsic measures for this task. We also demonstrate the performance of the methods over data sets with different document lengths. Our results show that clustering techniques applied to neural embedding feature representations delivered the best performance over all data sets using appropriate extrinsic evaluation measures. We also demonstrate a method for interpreting the clusters with a top-words based approach using tf-idf weights combined with embedding distance measures.
Braytee, A, Liu, W, Anaissi, A & Kennedy, PJ 2019, 'Correlated Multi-label Classification with Incomplete Label Space and Class Imbalance', ACM Transactions on Intelligent Systems and Technology (TIST), vol. 10, no. 5.
Hesamian, MH, Jia, W, He, X & Kennedy, P 2019, 'Deep Learning Techniques for Medical Image Segmentation: Achievements and Challenges.', Journal of Digital Imaging, vol. 32, no. 4, pp. 582-596.View/Download from: Publisher's site
Deep learning-based image segmentation is by now firmly established as a robust tool in image segmentation. It has been widely used to separate homogeneous areas as the first and critical component of diagnosis and treatment pipeline. In this article, we present a critical appraisal of popular methods that have employed deep-learning techniques for medical image segmentation. Moreover, we summarize the most common challenges incurred and suggest possible solutions.
Qiao, C, Lu, L, Yang, L & Kennedy, PJ 2019, 'Identifying brain abnormalities with schizophrenia based on a hybrid feature selection technology', Applied Sciences (Switzerland), vol. 9, no. 10.View/Download from: Publisher's site
© 2019 by the authors. Many medical imaging data, especially the magnetic resonance imaging (MRI) data, usually have a small sample size, but a large number of features. How to reduce effectively the data dimension and locate accurately the biomarkers from such kinds of data are quite crucial for diagnosis and further precision medicine. In this paper, we propose a hybrid feature selection method based on machine learning and traditional statistical approaches and explore the brain abnormalities of schizophrenia by using the functional and structural MRI data. The results show that the abnormal brain regions are mainly distributed in the supramarginal gyrus, cingulate gyrus, frontal gyrus, precuneus and caudate, and the abnormal functional connections are related to the caudate nucleus, insula and rolandic operculum. In addition, some complex network analyses based on graph theory are utilized on the functional connection data, and the results demonstrate that the located abnormal functional connections in brain can distinguish schizophrenia patients from healthy controls. The identified abnormalities in brain with schizophrenia by the proposed hybrid feature selection method show that there do exist some abnormal brain regions and abnormal disruption of the network segregation and network integration for schizophrenia, and these changes may lead to inaccurate and inefficient information processing and synthesis in the brain, which provide further evidence for the cognitive dysmetria of schizophrenia.
Gheisari, S, Catchpoole, DR, Charlton, A & Kennedy, PJ 2018, 'Convolutional Deep Belief Network with Feature Encoding for Classification of Neuroblastoma Histological Images.', Journal of Pathology Informatics, vol. 9, pp. 1-28.View/Download from: Publisher's site
Neuroblastoma is the most common extracranial solid tumor in children younger than 5 years old. Optimal management of neuroblastic tumors depends on many factors including histopathological classification. The gold standard for classification of neuroblastoma histological images is visual microscopic assessment. In this study, we propose and evaluate a deep learning approach to classify high-resolution digital images of neuroblastoma histology into five different classes determined by the Shimada classification.We apply a combination of convolutional deep belief network (CDBN) with feature encoding algorithm that automatically classifies digital images of neuroblastoma histology into five different classes. We design a three-layer CDBN to extract high-level features from neuroblastoma histological images and combine with a feature encoding model to extract features that are highly discriminative in the classification task. The extracted features are classified into five different classes using a support vector machine classifier.We constructed a dataset of 1043 neuroblastoma histological images derived from Aperio scanner from 125 patients representing different classes of neuroblastoma tumors.The weighted average F-measure of 86.01% was obtained from the selected high-level features, outperforming state-of-the-art methods.The proposed computer-aided classification system, which uses the combination of deep architecture and feature encoding to learn high-level features, is highly effective in the classification of neuroblastoma histological images.
Gheisari, S, Catchpoole, DR, Charlton, A, Melegh, Z, Gradhand, E & Kennedy, PJ 2018, 'Computer Aided Classification of Neuroblastoma Histological Images Using Scale Invariant Feature Transform with Feature Encoding.', Diagnostics, vol. 8, no. 3, pp. 1-18.View/Download from: Publisher's site
Neuroblastoma is the most common extracranial solid malignancy in early childhood. Optimal management of neuroblastoma depends on many factors, including histopathological classification. Although histopathology study is considered the gold standard for classification of neuroblastoma histological images, computers can help to extract many more features some of which may not be recognizable by human eyes. This paper, proposes a combination of Scale Invariant Feature Transform with feature encoding algorithm to extract highly discriminative features. Then, distinctive image features are classified by Support Vector Machine classifier into five clinically relevant classes. The advantage of our model is extracting features which are more robust to scale variation compared to the Patched Completed Local Binary Pattern and Completed Local Binary Pattern methods. We gathered a database of 1043 histologic images of neuroblastic tumours classified into five subtypes. Our approach identified features that outperformed the state-of-the-art on both our neuroblastoma dataset and a benchmark breast cancer dataset. Our method shows promise for classification of neuroblastoma histological images.
Goodswen, SJ, Kennedy, PJ & Ellis, JT 2018, 'A Gene-Based Positive Selection Detection Approach to Identify Vaccine Candidates Using Toxoplasma gondii as a Test Case Protozoan Pathogen.', Frontiers in Genetics, vol. 9.View/Download from: Publisher's site
Over the last two decades, various in silico approaches have been developed and refined that attempt to identify protein and/or peptide vaccines candidates from informative signals encoded in protein sequences of a target pathogen. As to date, no signal has been identified that clearly indicates a protein will effectively contribute to a protective immune response in a host. The premise for this study is that proteins under positive selection from the immune system are more likely suitable vaccine candidates than proteins exposed to other selection pressures. Furthermore, our expectation is that protein sequence regions encoding major histocompatibility complexes (MHC) binding peptides will contain consecutive positive selection sites. Using freely available data and bioinformatic tools, we present a high-throughput approach through a pipeline that predicts positive selection sites, protein subcellular locations, and sequence locations of medium to high T-Cell MHC class I binding peptides. Positive selection sites are estimated from a sequence alignment by comparing rates of synonymous (dS) and non-synonymous (dN) substitutions among protein coding sequences of orthologous genes in a phylogeny. The main pipeline output is a list of protein vaccine candidates predicted to be naturally exposed to the immune system and containing sites under positive selection. Candidates are ranked with respect to the number of consecutive sites located on protein sequence regions encoding MHCI-binding peptides. Results are constrained by the reliability of prediction programs and quality of input data. Protein sequences from Toxoplasma gondii ME49 strain (TGME49) were used as a case study. Surface antigen (SAG), dense granules (GRA), microneme (MIC), and rhoptry (ROP) proteins are considered worthy T. gondii candidates. Given 8263 TGME49 protein sequences processed anonymously, the top 10 predicted candidates were all worthy candidates. In particular, the top ten included ROP5 and...
Meng, Q, Catchpoole, D, Skillicorn, D & Kennedy, PJ 2017, 'DBNorm: Normalizing high-density oligonucleotide microarray data based on distributions', BMC Bioinformatics, vol. 18, no. 1.View/Download from: Publisher's site
© 2017 The Author(s). Background: Data from patients with rare diseases is often produced using different platforms and probe sets because patients are widely distributed in space and time. Aggregating such data requires a method of normalization that makes patient records comparable. Results: This paper proposed DBNorm, implemented as an R package, is an algorithm that normalizes arbitrarily distributed data to a common, comparable form. Specifically, DBNorm merges data distributions by fitting functions to each of them, and using the probability of each element drawn from the fitted distribution to merge it into a global distribution. DBNorm contains state-of-the-art fitting functions including Polynomial, Fourier and Gaussian distributions, and also allows users to define their own fitting functions if required. Conclusions: The performance of DBNorm is compared with z-score, average difference, quantile normalization and ComBat on a set of datasets, including several that are publically available. The performance of these normalization methods are compared using statistics, visualization, and classification when class labels are known based on a number of self-generated and public microarray datasets. The experimental results show that DBNorm achieves better normalization results than conventional methods. Finally, the approach has the potential to be applicable outside bioinformatics analysis.
Picard, K, Smith, W, Tran, M, Siwabessy, J & Kennedy, PJ 2017, 'Increased resolution bathymetry in the southeast Indian ocean: MH370 search data', Hydro International, vol. 28/09/2017, pp. 1-5.
The disappearance of Malaysian
Airlines flight MH370 on 8 March 2014
led to a deep ocean search effort of
unprecedented scale and detail in the
remote southeastern Indian Ocean.
Between June 2014 and January 2017,
two mapping phases took place: (1) a
shipborne bathymetric survey, and (2)
a higher-resolution search in areas
where accurate mapping of the
seafloor was required to guide the
detailed underwater search aimed at
locating the aircraft wreckage. The
latter phase used sidescan, multibeam
and synthetic aperture sonar
mounted on towed or autonomous
underwater vehicles (AUVs). This
article describes the mapping of the
area where the aircraft was expected
to be found.
Goodswen, SJ, Kennedy, PJ & Ellis, JT 2017, 'On the application of reverse vaccinology to parasitic diseases: a perspective on feature selection and ranking of vaccine candidates.', International journal for parasitology, vol. 47, no. 12, pp. 779-790.View/Download from: Publisher's site
Reverse vaccinology has the potential to rapidly advance vaccine development against parasites, but it is unclear which features studied in silico will advance vaccine development. Here we consider Neospora caninum which is a globally distributed protozoan parasite causing significant economic and reproductive loss to cattle industries worldwide. The aim of this study was to use a reverse vaccinology approach to compile a worthy vaccine candidate list for N. caninum, including proteins containing pathogen-associated molecular patterns to act as vaccine carriers. The in silico approach essentially involved collecting a wide range of gene and protein features from public databases or computationally predicting those for every known Neospora protein. This data collection was then analysed using an automated high-throughput process to identify candidates. The final vaccine list compiled was judged to be the optimum within the constraints of available data, current knowledge, and existing bioinformatics programs. We consider and provide some suggestions and experience on how ranking of vaccine candidate lists can be performed. This study is therefore important in that it provides a valuable resource for establishing new directions in vaccine research against neosporosis and other parasitic diseases of economic and medical importance.
Ahadi, A, Brennan, S, Kennedy, P, Hutvagner, G & Tran, N 2016, 'Long non-coding RNAs harboring miRNA seed regions are enriched in prostate cancer exosomes', Scientific Reports, vol. 6, pp. 1-14.View/Download from: Publisher's site
Long non-coding RNAs (lncRNAs) form the largest transcript class in the human transcriptome. These
lncRNA are expressed not only in the cells, but they are also present in the cell-derived extracellular
vesicles such as exosomes. The function of these lncRNAs in cancer biology is not entirely clear, but they
appear to be modulators of gene expression. In this study, we characterize the expression of lncRNAs
in several prostate cancer exosomes and their parental cell lines. We show that certain lncRNAs are
enriched in cancer exosomes with the overall expression signatures varying across cell lines. These
exosomal lncRNAs are themselves enriched for miRNA seeds with a preference for let-7 family members
as well as miR-17, miR-18a, miR-20a, miR-93 and miR-106b. The enrichment of miRNA seed regions in
exosomal lncRNAs is matched with a concomitant high expression of the same miRNA. In addition, the
exosomal lncRNAs also showed an over representation of RNA binding protein binding motifs. The two
most common motifs belonged to ELAVL1 and RBMX. Given the enrichment of miRNA and RBP sites on
exosomal lncRNAs, their interplay may suggest a possible function in prostate cancer carcinogenesis
Anaissi, A, Goyal, M, Catchpoole, DR, Braytee, A & Kennedy, PJ 2016, 'Ensemble Feature Learning of Genomic Data Using Support Vector Machine.', PLoS ONE, vol. 11, no. 6, pp. 1-17.View/Download from: Publisher's site
The identification of a subset of genes having the ability to capture the necessary information to distinguish classes of patients is crucial in bioinformatics applications. Ensemble and bagging methods have been shown to work effectively in the process of gene selection and classification. Testament to that is random forest which combines random decision trees with bagging to improve overall feature selection and classification accuracy. Surprisingly, the adoption of these methods in support vector machines has only recently received attention but mostly on classification not gene selection. This paper introduces an ensemble SVM-Recursive Feature Elimination (ESVM-RFE) for gene selection that follows the concepts of ensemble and bagging used in random forest but adopts the backward elimination strategy which is the rationale of RFE algorithm. The rationale behind this is, building ensemble SVM models using randomly drawn bootstrap samples from the training set, will produce different feature rankings which will be subsequently aggregated as one feature ranking. As a result, the decision for elimination of features is based upon the ranking of multiple SVM models instead of choosing one particular model. Moreover, this approach will address the problem of imbalanced datasets by constructing a nearly balanced bootstrap sample. Our experiments show that ESVM-RFE for gene selection substantially increased the classification performance on five microarray datasets compared to state-of-the-art methods. Experiments on the childhood leukaemia dataset show that an average 9% better accuracy is achieved by ESVM-RFE over SVM-RFE, and 5% over random forest based approach. The selected genes by the ESVM-RFE algorithm were further explored with Singular Value Decomposition (SVD) which reveals significant clusters with the selected data.
Nguyen, Q, Khalifa, N, Alzamora, P, Gleeson, A, Catchpoole, D, Kennedy, P & Simoff, S 2016, 'Visual Analytics of Complex Genomics Data to Guide Effective Treatment Decisions', Journal of Imaging, vol. 2, no. 4, pp. 1-17.View/Download from: Publisher's site
In cancer biology, genomics represents a big data problem that needs accurate visual data processing and analytics. The human genome is very complex with thousands of genes that contain the information about the individual patients and the biological mechanisms of their disease. Therefore, when building a framework for personalised treatment, the complexity of the genome must be captured in meaningful and actionable ways. This paper presents a novel visual analytics framework that enables effective analysis of large and complex genomics data. By providing interactive visualisations from the overview of the entire patient cohort to the detail view of individual genes, our work potentially guides effective treatment decisions for childhood cancer patients. The framework consists of multiple components enabling the complete analytics supporting personalised medicines, including similarity space construction, automated analysis, visualisation, gene-to-gene comparison and user-centric interaction and exploration based on feature selection. In addition to the traditional way to visualise data, we utilise the Unity3D platform for developing a smooth and interactive visual presentation of the information. This aims to provide better rendering, image quality, ergonomics and user experience to non-specialists or young users who are familiar with 3D gaming environments and interfaces. We illustrate the effectiveness of our approach through case studies with datasets from childhood cancers, B-cell Acute Lymphoblastic Leukaemia (ALL) and Rhabdomyosarcoma (RMS) patients, on how to guide the effective treatment decision in the cohort.
Anaissi, A, Goyal, M, Catchpoole, DR, Braytee, A & Kennedy, PJ 2015, 'Case-based retrieval framework for gene expression data.', Cancer Informatics, vol. 14, pp. 21-31.View/Download from: Publisher's site
BACKGROUND: The process of retrieving similar cases in a case-based reasoning system is considered a big challenge for gene expression data sets. The huge number of gene expression values generated by microarray technology leads to complex data sets and similarity measures for high-dimensional data are problematic. Hence, gene expression similarity measurements require numerous machine-learning and data-mining techniques, such as feature selection and dimensionality reduction, to be incorporated into the retrieval process. METHODS: This article proposes a case-based retrieval framework that uses a k-nearest-neighbor classifier with a weighted-feature-based similarity to retrieve previously treated patients based on their gene expression profiles. RESULTS: The herein-proposed methodology is validated on several data sets: a childhood leukemia data set collected from The Children's Hospital at Westmead, as well as the Colon cancer, the National Cancer Institute (NCI), and the Prostate cancer data sets. Results obtained by the proposed framework in retrieving patients of the data sets who are similar to new patients are as follows: 96% accuracy on the childhood leukemia data set, 95% on the NCI data set, 93% on the Colon cancer data set, and 98% on the Prostate cancer data set. CONCLUSION: The designed case-based retrieval framework is an appropriate choice for retrieving previous patients who are similar to a new patient, on the basis of their gene expression data, for better diagnosis and treatment of childhood leukemia. Moreover, this framework can be applied to other gene expression data sets using some or all of its steps.
Song, R, Catchpoole, DR, Kennedy, PJ & Li, J 2015, 'Identification of lung cancer miRNA-miRNA co-regulation networks through a progressive data refining approach', JOURNAL OF THEORETICAL BIOLOGY, vol. 380, pp. 271-279.View/Download from: Publisher's site
Goodswen, SJ, Barratt, JLN, Kennedy, PJ & Ellis, JT 2015, 'Improving the gene structure annotation of the apicomplexan parasite Neospora caninum fulfils a vital requirement towards an in silico-derived vaccine', International Journal for Parasitology, pp. 305-318.View/Download from: Publisher's site
Neospora caninum is an apicomplexan parasite which can cause abortion in cattle, instigating major economic burden. Vaccination has been proposed as the most cost-effective control measure to alleviate this burden. Consequently the overriding aspiration for N. caninum research is the identification and subsequent evaluation of vaccine candidates in animal models. To save time, cost and effort, it is now feasible to use an in silico approach for vaccine candidate prediction. Precise protein sequences, derived from the correct open reading frame, are paramount and arguably the most important factor determining the success or failure of this approach. The challenge is that publicly available N. caninum sequences are mostly derived from gene predictions. Annotated inaccuracies can lead to erroneously predicted vaccine candidates by bioinformatics programs. This study evaluates the current N. caninum annotation for potential inaccuracies. Comparisons with annotation from a closely related pathogen, Toxoplasma gondii, are also made to distinguish patterns of inconsistency. More importantly, a mRNA sequencing (RNA-Seq) experiment is used to validate the annotation. Potential discrepancies originating from a questionable start codon context and exon boundaries were identified in 1943 protein coding sequences. We conclude, where experimental data were available, that the majority of N. caninum gene sequences were reliably predicted. Nevertheless, almost 28% of genes were identified as questionable. Given the limitations of RNA-Seq, the intention of this study was not to replace the existing annotation but to support or oppose particular aspects of it. Ideally, many studies aimed at improving the annotation are required to build a consensus. We believe this study, in providing a new resource on gene structure and annotation, is a worthy contributor to this endeavour.
Ghous, H, Kennedy, PJ, Ho, N & Catchpoole, DR 2014, 'Comparing Functional Visualisations of Lists of Genes using Singular Value Decomposition', Journal of Research and Practice in Information Technology, vol. 47, no. 1, pp. 47-76.
Progress in understanding core pathways of cancer requires analysis of many genes. New insights are
hampered due to the lack of tools to make sense of large lists of genes identifi ed using high throughput
technology. Data mining, particularly visualisation that fi nds relationships between genes and the Gene
Ontology (GO), can assist in functional understanding. This paper addresses the question using GO
annotations for functional understanding of genes. We augment genes with GO terms using two similarity
measures: a Hop-based measure and an Information Content based measure, and visualise with Singular
Value Decomposition (SVD). The results demonstrate that SVD visualisation of GO augmented genes
matches the biological understanding expected in simulated and real-life data. Diff erences are observed in
visualisation of GO terms, where the information content method produces more tightly-packed clusters
than the hop-based method.
Tafavogh, S, Catchpoole, DR & Kennedy, PJ 2014, 'Cellular quantitative analysis of neuroblastoma tumor and splitting overlapping cells', BMC BIOINFORMATICS, vol. 15.View/Download from: Publisher's site
Goodswen, SJ, Kennedy, PJ & Ellis, JT 2014, 'Discovering a vaccine against neosporosis using computers: is it feasible?', Trends In Parasitology, vol. 30, no. 8, pp. 401-411.View/Download from: Publisher's site
Goodswen, SJ, Kennedy, PJ & Ellis, JT 2014, 'Enhancing In Silico Protein-Based Vaccine Discovery for Eukaryotic Pathogens Using Predicted Peptide-MHC Binding and Peptide Conservation Scores', PLOS ONE, vol. 9, no. 12.View/Download from: Publisher's site
Goodswen, SJ, Kennedy, PJ & Ellis, JT 2014, 'Vacceed: a high-throughput in silico vaccine candidate discovery pipeline for eukaryotic pathogens based on reverse vaccinology', Bioinformatics, vol. 30, no. 16, pp. 2381-2383.
Anaissi, A, Kennedy, PJ, Goyal, M & Catchpoole, DR 2013, 'A balanced iterative random forest for gene selection from microarray data', BMC Bioinformatics, vol. 14, pp. 1-10.View/Download from: Publisher's site
The wealth of gene expression values being generated by high throughput microarray technologies leads to complex high dimensional datasets. Moreover, many cohorts have the problem of imbalanced classes where the number of patients belonging to each class is not the same. With this kind of dataset, biologists need to identify a small number of informative genes that can be used as biomarkers for a disease.
This paper introduces a Balanced Iterative Random Forest (BIRF) algorithm to select the most relevant genes for a disease from imbalanced high-throughput gene expression microarray data. Balanced iterative random forest is applied on four cancer microarray datasets: a childhood leukaemia dataset, which represents the main target of this paper, collected from The Children's Hospital at Westmead, NCI 60, a Colon dataset and a Lung cancer dataset. The results obtained by BIRF are compared to those of Support Vector Machine-Recursive Feature Elimination (SVM-RFE), Multi-class SVM-RFE (MSVM-RFE), Random Forest (RF) and Naive Bayes (NB) classifiers. The results of the BIRF approach outperform these state-of-the-art methods, especially in the case of imbalanced datasets. Experiments on the childhood leukaemia dataset show that a 7% ∼ 12% better accuracy is achieved by BIRF over MSVM-RFE with the ability to predict patients in the minor class. The informative biomarkers selected by the BIRF algorithm were validated by repeating training experiments three times to see whether they are globally informative, or just selected by chance. The results show that 64% of the top genes consistently appear in the three lists, and the top 20 genes remain near the top in the other three lists.
The designed BIRF algorithm is an appropriate choice to select genes from imbalanced high-throughput gene expression microarray data. BIRF outperforms the state-of-the-art methods, especially the ability to handle the class-imbalanced data. Moreover, the analysis...
Tafavogh, S, Felix Navarro, KM, Catchpoole, DR & Kennedy, PJ 2013, 'Non-parametric and integrated framework for segmenting and counting neuroblastic cells within neuroblastoma tumor images', Medical & biological engineering & computing, vol. 51, no. 6, pp. 645-665.View/Download from: Publisher's site
Neuroblastoma is a malignant tumor and a cancer in childhood that derives from the neural crest. The number of neuroblastic cells within the tumor provides significant prognostic information for pathologists. An enormous number of neuroblastic cells makes the process of counting tedious and error-prone. We propose a user interaction-independent framework that segments cellular regions, splits the overlapping cells and counts the total number of single neuroblastic cells. Our novel segmentation algorithm regards an image as a feature space constructed by joint spatial-intensity features of color pixels. It clusters the pixels within the feature space using mean-shift and then partitions the image into multiple tiles. We propose a novel color analysis approach to select the tiles with similar intensity to the cellular regions. The selected tiles contain a mixture of single and overlapping cells. We therefore also propose a cell counting method to analyse morphology of the cells and discriminate between overlapping and single cells. Ultimately, we apply watershed to split overlapping cells. The results have been evaluated by a pathologist. Our segmentation algorithm was compared against adaptive thresholding. Our cell counting algorithm was compared with two state of the art algorithms. The overall cell counting accuracy of the system is 87.65 %
Goodswen, SJ, Kennedy, PJ & Ellis, JT 2013, 'A guide to in silico vaccine discovery for eukaryotic pathogens', Briefings in Bioinformatics, vol. 14, no. 6, pp. 753-774.View/Download from: Publisher's site
In this article, a framework for an in silico pipeline is presented as a guide to high-throughput vaccine candidate discovery for eukaryotic pathogens, such as helminths and protozoa. Eukaryotic pathogens are mostly parasitic and cause some of the most damaging and difficult to treat diseases in humans and livestock. Consequently, these parasitic pathogens have a significant impact on economy and human health. The pipeline is based on the principle of reverse vaccinology and is constructed from freely available bioinformatics programs. There are several successful applications of reverse vaccinology to the discovery of subunit vaccines against prokaryotic pathogens but not yet against eukaryotic pathogens. The overriding aim of the pipeline, which focuses on eukaryotic pathogens, is to generate through computational processes of elimination and evidence gathering a ranked list of proteins based on a scoring system. These proteins are either surface components of the target pathogen or are secreted by the pathogen and are of a type known to be antigenic. No perfect predictive method is yet available; therefore, the highest-scoring proteins from the list require laboratory validation.
Goodswen, SJ, Kennedy, PJ & Ellis, JT 2013, 'A novel strategy for classifying the output from an in silico vaccine discovery pipeline for eukaryotic pathogens using machine learning algorithms', BMC Bioinformatics, vol. 14, no. 1, pp. 315-327.View/Download from: Publisher's site
An in silico vaccine discovery pipeline for eukaryotic pathogens typically consists of several computational tools to predict protein characteristics. The aim of the in silico approach to discovering subunit vaccines is to use predicted characteristics to identify proteins which are worthy of laboratory investigation. A major challenge is that these predictions are inherent with hidden inaccuracies and contradictions. This study focuses on how to reduce the number of false candidates using machine learning algorithms rather than relying on expensive laboratory validation. Proteins from Toxoplasma gondii, Plasmodium sp., and Caenorhabditis elegans were used as training and test datasets.
Goodswen, SJ, Kennedy, PJ & Ellis, JT 2013, 'A review of the infection, genetics, and evolution of Neospora caninum: from the past to the present', Infection, Genetics and Evolution, vol. 13, no. 1, pp. 133-150.View/Download from: Publisher's site
This paper is a review of current knowledge on Neospora caninum in the context of other apicomplexan parasites and with an emphasis on: life cycle, disease, epidemiology, immunity, control and treatment, evolution, genomes, and biological databases and web resources. N. caninum is an obligate, intracellular, coccidian, protozoan parasite of the phylum Apicomplexa. Infection can cause the clinical disease neosporosis, which most notably is associated with abortion in cattle. These abortions are a major root cause of economic loss to both the dairy and beef industries worldwide. N. caninum has been detected in every country in which a study has been specifically conducted to detect this parasite in cattle. The major mode of transmission in cattle is transplacental (or vertical) transmission and several elements of the N. caninum life cycle are yet to be studied in detail. The outcome of an infection is inextricably linked to the precise timing of the infection coupled with the status of the immune system of the dam and foetus. There is no community consensus as to whether it is the dams pro-inflammatory cytotoxic response to tachyzoites that kills the foetus or the tachyzoites themselves. From economic analysis the most cost-effective approach to control neosporosis is a vaccine. The perfect vaccine would protect against both infection and the clinical disease, and this implies a vaccine is needed that can induce a non-foetopathic cell mediated immunity response. Researchers are beginning to capitalise on the vast potential of -omics data (e.g. genomes, transcriptomes, and proteomes) to further our understanding of pathogens but especially to identify vaccine and drug targets. The recent publication of a genome for N. caninum offers vast opportunities in these areas.
Unprocessed rock is a massive resource of very cheap building material with very low embodied energy. However, it is highly underutilised due to the difficulty of dealing with irregular shaped blocks. We have developed a novel software application using the artificial intelligence methods of search and optimisation to simulate building three-dimensional structures in a virtual world. The aim of our software is to help builders solve the 3-dimensional jigsaw puzzle of building with rock rubble with an emphasis on its potential use for building sustainable housing and infrastructure. This paper describes our approach and the design of our software including an overview of the rock digitising, optimisation software and building methods. We present simulation results of building and testing several small drystone structures using the prototype software
Ellis, JT, Goodswen, SJ, Kennedy, PJ & Bush, SA 2012, 'The Core Mouse Response to Infection by Neospora Caninum Defined by Gene Set Enrichment Analyses', Bioinformatics and Biology Insights, vol. 6, pp. 187-202.View/Download from: Publisher's site
.In this study, the BALB/c and Qs mouse responses to infection by the parasite Neospora caninum were investigated in order to identify host response mechanisms. Investigation was done using gene set (enrichment) analyses of microarray data. GSEA, MANOVA, Romer, subGSE and SAM-GS were used to study the contrasts Neospora strain type, Mouse type (BALB/c and Qs) and time post infection (6 hours post infection and 10 days post infection). The analyses show that the major signal in the core mouse response to infection is from time post infection and can be defined by gene ontology terms Protein Kinase Activity, Cell Proliferation and Transcription Initiation. Several terms linked to signaling, morphogenesis, response and fat metabolism were also identified. At 10 days post infection, genes associated with fatty acid metabolism were identified as up regulated in expression. The value of gene set (enrichment) analyses in the analysis of microarray data is discussed.
Goodswen, SJ, Kennedy, PJ & Ellis, JT 2012, 'Evaluating High-Throughput Ab Initio Gene Finders to Discover Proteins Encoded in Eukaryotic Pathogen Genomes Missed by Laboratory Techniques', PLOS ONE, vol. 7, no. 11.View/Download from: Publisher's site
Analysis and visualization of microarraydata is veryassistantfor biologists and clinicians in the field of diagnosis and treatment of patients. It allows Clinicians to better understand the structure of microarray and facilitates understanding gene expression in cells. However, microarray dataset is a complex data set and has thousands of features and a very small number of observations. This very high dimensional data set often contains some noise, non-useful information and a small number of relevant features for disease or genotype. This paper proposes a non-linear dimensionality reduction algorithm Local Principal Component (LPC) which aims to maps high dimensional data to a lower dimensional space. The reduced data represents the most important variables underlying the original data. Experimental results and comparisons are presented to show the quality of the proposed algorithm. Moreover, experiments also show how this algorithm reduces high dimensional data whilst preserving the neighbourhoods of the points in the low dimensional space as in the high dimensional space.
Catchpoole, DR, Kennedy, P, Skillicorn, DB & Simoff, S 2010, 'The Curse of Dimensionality: A Blessing to Personalized Medicine', JOURNAL OF CLINICAL ONCOLOGY, vol. 28, no. 34, pp. E723-E724.View/Download from: Publisher's site
Milton, J & Kennedy, PJ 2010, 'Static and Dynamic Selection Thresholds Governing the Accumulation of Information in Genetic Algorithms Using Ranked Populations', Evolutionary Computation, vol. 18, no. 2, pp. 229-254.View/Download from: Publisher's site
Mutation applied indiscriminately across a population has, on average, a detrimental effect on the accumulation of solution alleles within the population and is usually beneficial only when targeted at individuals with few solution alleles. Many common selection techniques can delete individuals with more solution alleles than are easily recovered by mutation. The paper identifies static and dynamic selection thresholds governing accumulation of information in a genetic algorithm (GA). When individuals are ranked by fitness, there exists a dynamic threshold defined by the solution density of surviving individuals and a lower static threshold defined by the solution density of the information source used for mutation. Replacing individuals ranked below the static threshold with randomly generated individuals avoids the need for mutation while maintaining diversity in the population with a consequent improvement in population fitness. By replacing individuals ranked between the thresholds with randomly selected individuals from above the dynamic threshold, population fitness improves dramatically. We model the dynamic behavior of GAs using these thresholds and demonstrate their effectiveness by simulation and benchmark problems.
Kennedy, PJ & Osborn, T 2001, 'A Model of Gene Expression and Regulation in an Artificial Cellular Organism', Complex Systems, vol. 13, no. 1.
Gene expression and regulation may be viewed as a parallel parsing algorithm---translation from a genomic language to a phenotype. We describe a model of gene expression and regulation based on the operon model of Jacob and Monod. Operons are groups of genes regulated in the same way. An artificial cellular metabolism expresses operons encoded on a genome in a parallel genomic language. This is accomplished using an abstract entity called a spider. A genetic algorithm is used to evolve the simulated cells to adapt to a simple environment. Genomes are subjected to recombination, mutation, and inversion operators. Observations from this experiment suggest four areas to explore: dynamic environments for the evolution of regulation, advantages of time lags inherent in the expression algorithm, sensitivity of our genomic language, and noncoding regions on the genome. Issues relating to the application of the expression model to evolutionary computation are discussed.
Braytee, A, Gill, AQ, Kennedy, PJ & Hussain, FK 2015, 'A Review and Comparison of Service E-Contract Architecture Metamodels' in Neural Information Processing, Springer International Publishing, pp. 583-595.View/Download from: Publisher's site
Ubaudi, FA, Kennedy, PJ, Catchpoole, DR, Guo, D & Simoff, SJ 2009, 'Microarray data mining: selecting trustworthy genes with gene feature ranking' in Data Mining for Business Applications, Springer, New York, USA, pp. 159-168.View/Download from: Publisher's site
Gene expression datasets used in biomedical data mining frequently have two characteristics: they have many thousand attributes but only relatively few sample points and the measurements are noisy. In other words, individual expression measurements may be untrustworthy. Gene Feature Ranking (GFR) is a feature selection methodology that addresses these domain specific characteristics by selecting features (i.e. genes) based on two criteria: (i) how well the gene can discriminate between classes of patient and (ii) the trustworthiness of the microarray data associated with the gene. An example from the pediatric cancer domain demonstrates the use of GFR and compares its performance with a feature selection method that does not explicitly address the trustworthiness of the underlying data.
Kennedy, PJ, Simoff, SJ, Catchpoole, DR, Skillicorn, D, Ubaudi, FA & Aloqaily, A 2008, 'Integrative visual data mining of biomedical data: Investigating cases in Chronic Fatigue Syndrome and Acute Lymphoblastic Leukaemia' in Simoff, SJ, Bohlen, MH & Mazeika, A (eds), Visual Data Mining: Theory, Techniques and Tools for Visual Analytics, Springer, Berlin Heidelberg, pp. 367-388.View/Download from: Publisher's site
This chapter presents an integrative visual data mining approach towards biomedical data. This approach and supporting methodology are presented at a high level. They combine in a consistent manner a set of visualisation and data mining techniques that operate over an integrated data set of several diverse components, including medical (clinical) data, patient outcome and interview data, corresponding gene expression and SNP data, domain ontologies and health management data. The practical application of the methodology and the specific data mining techniques engaged are demonstrated on two case studies focused on the biological mechanisms of two different types of diseases: Chronic Fatigue Syndrome and Acute Lymphoblastic Leukaemia, respectively. The common between the cases is the structure of the data sets.
Sidhu, AS, Kennedy, PJ, Simoff, S, Dillon, TS & Chang, E 2008, 'Knowledge discovery in biomedical data facilitated by domain ontologies' in Medical Informatics: Concepts, Methodologies, Tools, and Applications, pp. 2096-2108.View/Download from: Publisher's site
© 2009 by IGI Global. All rights reserved. In some real-world areas, it is important to enrich the data with external background knowledge so as to provide context and to facilitate pattern recognition. These areas may be described as data rich but knowledge poor. There are two challenges to incorporate this biological knowledge into the data mining cycle: (1) generating the ontologies; and (2) adapting the data mining algorithms to make use of the ontologies. This chapter presents the state-of-the-art in bringing the background ontology knowledge into the pattern recognition task for biomedical data.
Sidhu, AS, Kennedy, PJ, Simoff, SJ, Dillon, TS & Chang, E 2007, 'Knowledge Discovery in Biomedical Data Facilitated by Domain Ontologies' in Zhu, X & Davidson, I (eds), Knowledge Discovery and Data Mining: Challenges and Realities, IGI Global, Hershey, USA, pp. 189-201.
Naji, M, Braytee, A, Al-Ani, A, Anaissi, A, Goyal, M & Kennedy, PJ 2020, 'Design of airport security screening using queueing theory augmented with particle swarm optimisation', Service Oriented Computing and Applications, pp. 119-133.View/Download from: Publisher's site
© 2020, Springer-Verlag London Ltd., part of Springer Nature. Designing an efficient and reliable airport security screening system is a critical and challenging task. It is an essential element of airline and passenger safety which aims to provide the expected level of confidence and to ensure the safety of passengers and the aviation industry. In recent years, security at airports has gone through noticeable improvements with the utilisation of advanced technology and highly trained security officers. However, for many airports, it is important to find the best compromise between the capacity of the security area, the number of passengers and the number of screening machines and officers to maintain a high level of security and to ensure that the cost and waiting times for passengers and airlines are at acceptable levels. This paper proposes a novel method based on queueing theory augmented with particle swarm optimisation (QT-PSO) to predict passenger waiting times in a security screening context. This model consists of multiple servers operating in parallel and takes into consideration the complete scenario such as normal, slow and express lanes. Such an approach has the potential to be a reliable model that is able to assimilate variations in the number of passengers, security officers and security machines on the service time. To evaluate our proposed method, we collected real-world security screening data from an Australian airport from December to March for the two consecutive years of 2016 and 2017. The results show that our proposed QT-PSO method is superior to predict the average waiting time of passengers compared to the state of the art.
Brunker, A, Catchpoole, D, Kennedy, P, Simoff, S & Nguyen, QV 2019, 'Two-dimensional immersive cohort analysis supporting personalised medical treatment', Proceedings - 2019 23rd International Conference in Information Visualization - Part II, IV-2 2019, International Conference in Information Visualization, IEEE, Adelaide, Australia, pp. 34-41.View/Download from: Publisher's site
© 2019 IEEE. Genomic data are large and complex which are challenges to visualize them effectively on ordinary screens due to the limited display spaces. Large and high resolution displays could enable the capability to show more information at once for better comprehension from the visualization. This paper presents a two-dimensional interactive visualization system and supporting algorithm for multi-dimensional large genomic data analysis that can be used in both ordinary displays or immersive environments. We provide both view of the entire patient cohort in the similarity space and the genomic details currently for comparison among the patients. Through the similarity space and on the selected genes of interest, we are able to perceive the genetic similarity throughout the cohort. From the linked heat map visualisation of the selected genes, we apply hierarchical clustering on both the horizontal and vertical axes to group together the genetically similar patients. We demonstrate the effectiveness of the visualization with two case studies on pediatric cancer patients suffering from Acute Lymphoblastic Leukemia (ALL) and from Rhabdomyosarcoma (RMS)
Hesamian, MH, Jia, W, He, X & Kennedy, PJ 2019, 'Atrous Convolution for Binary Semantic Segmentation of Lung Nodule', ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, Brighton. UK, pp. 1015-1019.View/Download from: Publisher's site
© 2019 IEEE. Accurately estimating the size of tumours and reproducing their boundaries from lung CT images provides crucial information for early diagnosis, staging and evaluating patients response to cancer therapy. This paper presents an advanced solution to segment lung nodules from CT images by employing a deep residual network structure with Atrous convolution. The Atrous convolution increases the field of view of the filters and helps to improve classification accuracy. Moreover, in order to address the significant class imbalance issue between the nodule pixels and background non-nodule pixels, a weighted loss function is proposed. We evaluate our proposed solution on the widely adopted benchmark dataset LIDC. A promising result of an average DCS of 81.24% is achieved, outperforming the state of the arts. This demonstrates the effectiveness and importance of applying the Atrous convolution and weighted loss for such problems.
Naji, M, Al-Ani, A, Braytee, A, Anaissi, A & Kennedy, P 2019, 'Queue Formation Augmented with Particle Swarm Optimisation to Improve Waiting Time in Airport Security Screening', Advances in Intelligent Systems and Computing, Workshops of the 33rd International Conference on Advanced Information Networking and Applications, Springer, Japan, pp. 923-935.View/Download from: Publisher's site
© 2019, Springer Nature Switzerland AG. Airport security screening processes are essential to ensure the safety of both passengers and the aviation industry. Security at airports has improved noticeably in recent years through the utilisation of state-of-the-art technologies and highly trained security officers. However, maintaining a high level of security can be costly to operate and implement. It may also lead to delays for passengers and airlines. This paper proposes a novel queue formation method based on a queueing theory model augmented with a particle swarm optimisation method known as QQT-PSO to improve the average waiting time in airport security areas. Extensive experiments were conducted using real-world datasets collected from Sydney airport. Compared to the existing system, our method significantly reduces the average waiting time and operating cost by 11.89% compared to the one-queue formation.
Awan, Z, Kahlke, T, Ralph, PJ & Kennedy, PJ 2019, 'Chemical named entity recognition with deep contextualized neural embeddings', IC3K 2019 - Proceedings of the 11th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management, International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management., Austria, pp. 135-144.View/Download from: Publisher's site
Copyright © 2019 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved Chemical named entity recognition (ChemNER) is a preliminary step in chemical information extraction pipelines. ChemNER has been approached using rule-based, dictionary-based, and feature-engineered based machine learning, and more recently also deep learning based methods. Traditional word-embeddings, like word2vec and Glove, are inherently problematic because they ignore the context in which an entity appears. Contextualized embeddings called embedded language models (ELMo) have been recently introduced to represent contextual information of a word in its embedding space. In this work, we quantify the impact of contextualized embeddings for ChemNER by using Bi-LSTM-CRF (bidirectional long short term memory networks - conditional random fields) networks. We benchmarked our approach using four well-known corpora for chemical named entity recognition. Our results show that incorporation of ELMo results in statistically significant improvements in F1 score in all of the tested datasets.
Hayati, H, Walker, P, Brown, T, Kennedy, P & Eager, D 2018, 'A simple spring-loaded inverted pendulum (SLIP) model of a bio-inspired quadrupedal robot over compliant terrains', Proceedings of the ASME 2018 International Mechanical Engineering Congress and Exposition IMECE2018, International Mechanical Engineering Congress and Exposition, ASME, USA.View/Download from: Publisher's site
Copyright © 2018 ASME. To study the impact of compliant terrains on the biomechanics of rapid legged movements, a well-known spring loaded inverted pendulum (SLIP) model is deployed. The model is a three-degrees-of-freedom system (3 DOF), inspired by galloping greyhounds competing in a racing condition. A single support phase of hind-leg stance in a galloping gait is taken into consideration due to its primary function in powering the greyhounds locomotion and higher rate of musculoskeletal injuries. To obtain and solve the nonlinear second-order differential equation of motions, the Lagrangian method and MATLABb R2017b (ode45 solver), which is based on the Runge-Kutta method, has been used, respectively. To get the viscoelastic behavior of compliant terrains, a Clegg hammer test was developed and performed five times on each sample. The effective spring and damping coefficients of each sample were then determined from the hysteresis curves. The results showed that galloping on the synthetic rubber requires more muscle force compared with wet sand. However, according to the Clegg hammer test, wet sand had a higher impact force than synthetic rubber which can be a risk factor for bone fracture, particularly hock fracture, in greyhounds. The results reported in this paper are not only useful for identifying optimum terrain properties and injury thresholds of an athletic track, but also can be used to design control methods and shock impedances for legged robots performing on compliant terrains
Braytee, A, Anaissi, A & Kennedy, PJ 2018, 'Sparse feature learning using ensemble model for highly-correlated high-dimensional data', Neural Information Processing (LNCS), International Conference on Neural Information Processing, Springer, Siem Reap, Cambodia, pp. 423-434.View/Download from: Publisher's site
© Springer Nature Switzerland AG 2018. High-dimensional highly correlated data exist in several domains such as genomics. Many feature selection techniques consider correlated features as redundant and therefore need to be removed. Several studies investigate the interpretation of the correlated features in domains such as genomics, but investigating the classification capabilities of the correlated feature groups is a point of interest in several domains. In this paper, a novel method is proposed by integrating the ensemble feature ranking and co-expression networks to identify the optimal features for classification. The main advantage of the proposed method lies in the fact, that it does not consider the correlated features as redundant. But, it shows the importance of the selected correlated features to improve the performance of classification. A series of experiments on five high dimensional highly correlated datasets with different levels of imbalance ratios show that the proposed method outperformed the state-of-the-art methods.
Gheisari, S, Catchpoole, DR, Charlton, A & Kennedy, PJ 2018, 'Patched completed local binary pattern is an effective method for neuroblastoma histological image classification', Communications in Computer and Information Science, Australasian Conference on Data Mining, Bathurst, NSW, Australia, pp. 57-71.View/Download from: Publisher's site
© Springer Nature Singapore Pte Ltd. 2018. Neuroblastoma is the most common extra cranial solid tumour in children. The histology of neuroblastoma has high intra-class variation, which misleads existing computer-aided histological image classification methods that use global features. To tackle this problem, we propose a new Patched Completed Local Binary Pattern (PCLBP) method combining Sign Binary Pattern (SBP) and Magnitude Binary Pattern (MBP) within local patches to build feature vectors which are classified by k-Nearest Neighbor (k-NN) and Support Vector Machine (SVM) classifiers. The advantage of our method is extracting local features which are more robust to intra-class variation compared to global ones. We gathered a database of 1043 histologic images of neuroblastic tumours classified into five subtypes. Our experiments show the proposed method improves the weighted average F-measure by 1.89% and 0.81% with k-NN and SVM classifiers, respectively.
Mahdavi, F, Hossain, MI, Hayati, H, Eager, D & Kennedy, P 2018, 'Track Shape, Resulting Dynamics and Injury Rates of Greyhounds', Volume 13: Design, Reliability, Safety, and Risk, International Mechanical Engineering Congress and Exposition, ASME, Pittsburgh, Pennsylvania, USA.View/Download from: Publisher's site
Roberts, AGK, Catchpoole, DR & Kennedy, PJ 2018, 'Variance-based Feature Selection for Classification of Cancer Subtypes Using Gene Expression Data', Proceedings of the International Joint Conference on Neural Networks, International Joint Conference on Neural Networks, IEEE, Rio de Janeiro, Brazil.View/Download from: Publisher's site
© 2018 IEEE. Classification in cancer has traditionally relied on feature selection by differential expression as a first step, where genes are selected according to the strength of evidence for a consistent difference in expression level between classes. However, recent work has shown that many genes also differ in the variance of their gene expression between disease states, and in particular between cancers of different types, prognosis, or stages of development. Features selected based on increased variance in cancer or differences in variance between tumours of differing prognosis have been used to successfully predict tumour progression or prognosis within the same cancer type, and to classify cancer subtypes in cases where there is an overall increase in variance in one class over the other. Here, we apply feature selection by differential variance to the more general problem of classification of cancer subtypes. We show that classifiers using features selected by differential variance are able to distinguish between clinically relevant cancer subtypes, that these classifiers perform as well as classifiers based on features selected by differential expression, and that combining the two approaches often gives better classification results than either feature selection method alone.
Braytee, A, Liu, W & Kennedy, PJ 2017, 'Supervised context-aware non-negative matrix factorization to handle high-dimensional high-correlated imbalanced biomedical data', Proceedings of the International Joint Conference on Neural Networks, 2017 International Joint Conference on Neural Networks, Anchorage, AK, USA, pp. 4512-4519.View/Download from: Publisher's site
© 2017 IEEE. Traditional feature selection techniques are used to identify a subset of the most useful features, and consider the rest as unimportant, redundant or noisy. In the presence of highly correlated features, many variable selection methods consider correlated features as redundant and need to be removed. In this paper, a novel supervised feature selection algorithm SCANMF is proposed by jointly integrating correlation analysis and structural analysis of the balanced supervised non-negative matrix factorization (NMF). Furthermore, ℓ2,1-norm minimization constraint is incorporated into the objective function to guarantee sparsity in the feature matrix rows and reduce noisy features. Our algorithm exploits the discriminative information, feature combinations, and the original features in the context of a supervised NMF method which can be beneficial for both classification and interpretation. An efficient iterative algorithm is designed to solve the constrained optimization problem with guaranteed convergence. Finally, a series of extensive experiments are conducted on 8 complex datasets. Promising results using multiple classifiers demonstrate the effectiveness and efficiency of our algorithm over state-of-the-art methods.
Braytee, A, Liu, W, Catchpoole, DR & Kennedy, PJ 2017, 'Multi-label feature selection using correlation information', International Conference on Information and Knowledge Management, Proceedings, ACM on Conference on Information and Knowledge Management, ACM, Singapore, Singapore, pp. 1649-1656.View/Download from: Publisher's site
© 2017 ACM. High-dimensional multi-labeled data contain instances, where each instance is associated with a set of class labels and has a large number of noisy and irrelevant features. Feature selection has been shown to have great benefits in improving the classification performance in machine learning. In multi-label learning, to select the discriminative features among multiple labels, several challenges should be considered: interdependent labels, different instances may share different label correlations, correlated features, and missing and .awed labels. This work is part of a project at .e Children's Hospital at Westmead (TB-CHW), Australia to explore the genomics of childhood leukaemia. In this paper, we propose a CMFS (Correlated-and Multi-label Feature Selection method), based on non-negative matrix factorization (NMF) for simultaneously performing feature selection and addressing the aforementioned challenges. Significantly, a major advantage of our research is to exploit the correlation information contained in features, labels and instances to select the relevant features among multiple labels. Furthermore, l2;1-norm regularization is incorporated in the objective function to undertake feature selection by imposing sparsity on the feature matrix rows. We employ CMFS to decompose the data and multi-label matrices into a low-dimensional space. To solve the objective function, an efficient iterative optimization algorithm is proposed with guaranteed convergence. Finally, extensive experiments are conducted on high-dimensional multi-labeled datasets. The experimental results demonstrate that our method significantly outperforms state-of-the-art multi-label feature selection methods.
Meng, Q, Catchpoole, D, Skillicom, D & Kennedy, PJ 2019, 'Relational autoencoder for feature extraction', Proceedings of the International Joint Conference on Neural Networks, International Joint Conference on Neural Networks, IEEE, Anchorage, AK, USA, pp. 364-371.View/Download from: Publisher's site
© 2017 IEEE. Feature extraction becomes increasingly important as data grows high dimensional. Autoencoder as a neural network based feature extraction method achieves great success in generating abstract features of high dimensional data. However, it fails to consider the relationships of data samples which may affect experimental results of using original and new features. In this paper, we propose a Relation Autoencoder model considering both data features and their relationships. We also extend it to work with other major autoencoder models including Sparse Autoencoder, Denoising Autoencoder and Variational Autoencoder. The proposed relational autoencoder models are evaluated on a set of benchmark datasets and the experimental results show that considering data relationships can generate more robust features which achieve lower construction loss and then lower error rate in further classification compared to the other variants of autoencoders.
Meng, Q, Wu, J, Ellis, J & Kennedy, PJ 2017, 'Dynamic island model based on spectral clustering in genetic algorithm', Proceedings of the International Joint Conference on Neural Networks, International Joint Conference on Neural Networks, IEEE, Anchorage, AK, USA, pp. 1724-1731.View/Download from: Publisher's site
© 2017 IEEE. How to maintain relative high diversity is important to avoid premature convergence in population-based optimization methods. Island model is widely considered as a major approach to achieve this because of its flexibility and high efficiency. The model maintains a group of sub-populations on different islands and allows sub-populations to interact with each other via predefined migration policies. However, current island model has some drawbacks. One is that after a certain number of generations, different islands may retain quite similar, converged sub-populations thereby losing diversity and decreasing efficiency. Another drawback is that determining the number of islands to maintain is also very challenging. Meanwhile initializing many sub-populations increases the randomness of island model. To address these issues, we proposed a dynamic island model (DIM-SP) which can force each island to maintain different sub-populations, control the number of islands dynamically and starts with one sub-population. The proposed island model outperforms the other three state-of-the-art island models in three baseline optimization problems including job shop scheduler, travelling salesmen, and quadratic multiple knapsack.
Braytee, A, Catchpoole, DR, Kennedy, PJ & Liu, W 2016, 'Balanced Supervised Non-Negative Matrix Factorization for Childhood Leukaemia Patients', CIKM '16 Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, ACM International Conference on Information and Knowledge Management, ACM, Indianapolis, Indiana, USA.View/Download from: Publisher's site
Supervised feature extraction methods have received considerable attention in the data mining community due to their capability to improve the classification performance of the unsupervised dimensionality reduction methods. With increasing dimensionality, several methods based on supervised feature extraction are proposed to achieve a feature ranking especially on microarray gene expression data. This paper proposes a method with twofold objectives: it implements a balanced supervised non-negative matrix factorization (BSNMF) to handle the class imbalance problem in supervised non-negative matrix factorization techniques. Furthermore, it proposes an accurate gene ranking method based on our proposed BSNMF for microarray gene expression datasets. To the best of our knowledge, this is the first work to handle the class imbalance problem in supervised feature extraction methods. This work is part of a Human Genome project at The Children's Hospital at Westmead (TB-CHW), Australia. Our experiments indicate that the factorized components using supervised feature extraction approach have more classification capability than the unsupervised one, but it drastically fails at the presence of class imbalance problem. Our proposed method outperforms the state-of-the-art methods and shows promise in overcoming this concern.
Braytee, A, Liu, W & Kennedy, P 2016, 'A Cost-Sensitive Learning Strategy for Feature Extraction from Imbalanced Data', Springer International Publishing, International Conference on Neural Information Processing, Springer International Publishing, Kyoto, Japan, pp. 78-86.View/Download from: Publisher's site
In this paper, novel cost-sensitive principal component analysis (CSPCA) and cost-sensitive non-negative matrix factorization (CSNMF) methods are proposed for handling the problem of feature extraction from imbalanced data. The presence of highly imbalanced data misleads existing feature extraction techniques to produce biased features, which results in poor classification performance especially for the minor class problem. To solve this problem, we propose a cost-sensitive learning strategy for feature extraction techniques that uses the imbalance ratio of classes to discount the majority samples. This strategy is adapted to the popular feature extraction methods such as PCA and NMF. The main advantage of the proposed methods is that they are able to lessen the inherent bias of the extracted features to the majority class in existing PCA and NMF algorithms. Experiments on twelve public datasets with different levels of imbalance ratios show that the proposed methods outperformed the state-of-the-art methods on multiple classifiers.
Wang, S, Liu, W, Wu, J, Cao, L, Meng, Q & Kennedy, PJ 2016, 'Training deep neural networks on imbalanced data sets', Proceedings of the International Joint Conference on Neural Networks, IEEE International Joint Conference on Neural Networks, IEEE, Vancouver, Canada, pp. 4368-4374.View/Download from: Publisher's site
© 2016 IEEE.Deep learning has become increasingly popular in both academic and industrial areas in the past years. Various domains including pattern recognition, computer vision, and natural language processing have witnessed the great power of deep networks. However, current studies on deep learning mainly focus on data sets with balanced class labels, while its performance on imbalanced data is not well examined. Imbalanced data sets exist widely in real world and they have been providing great challenges for classification tasks. In this paper, we focus on the problem of classification using deep network on imbalanced data sets. Specifically, a novel loss function called mean false error together with its improved version mean squared false error are proposed for the training of deep networks on imbalanced data sets. The proposed method can effectively capture classification errors from both majority class and minority class equally. Experiments and comparisons demonstrate the superiority of the proposed approach compared with conventional methods in classifying imbalanced data sets on deep neural networks.
Braytee, A, Gill, AQ, Kennedy, PJ & Hussain, FK 2015, 'A Review and comparison of service E-Contract Architecture Metamodels', Neural Information Processing (LNCS), International Conference on Neural Information Processing, Springer, Istanbul, Turkey, pp. 583-595.View/Download from: Publisher's site
© Springer International Publishing Switzerland 2015. An adaptive service e-contract is an electronic agreement which is required to enable adaptive or agile service sourcing and pro- visioning. There are a number of e-contract metamodels that can be used to create a context specific adaptive service e-contract. The chal- lenge is which one to choose and adopt for adaptive services. This paper presents a review and comparison of well-known e-contract metamod- els using the architecture theory. The architecture theory allows the analysis of the e-contract metamodels using a three-dimension analyt- ical lens: structure, behavior and technology. The results of this paper highlight the metamodels structural, behavioral and technological differ- ences and similarities. This paper will help researchers and practitioners to observe the existing e-contract metamodels are appropriate to the adaptive services or if thwhetherere is a need to merge and integrate the concepts of these metamodels to propose a new unifying adaptive service e-contract metamodel. This paper is limited to the number of compared metamodels.
Braytee, A, Hussain, F, Anaissi, A & Kennedy, PJ 2015, 'ABC-Sampling for balancing imbalanced datasets based on Artificial Bee Colony algorithm', Proceedings 2015 IEEE 14th International Conference on Machine Learning and Applications ICMLA 2015, International Conference on Machine Learning and Applications, IEEE, Miami, Florida, pp. 594-599.View/Download from: Publisher's site
Class imbalanced data is a common problem for predictive modelling in domains such as bioinformatics. It occurs when the distribution of classes is not uniform among samples and results in a biased prediction of learning towards majority classes. In this study, we propose the ABC-Sampling algorithm based on a swarm optimization method called Artificial Bee Colony, which models the natural foraging behaviour of honeybees. Our algorithm lessens the effects of imbalanced classes by selecting the most informative majority samples using a forward search and storing them in a ranked subset. Then we construct a balanced dataset with a planned undersampling strategy to extract the most frequent majority samples from the top ranked subset and combine them with all minority samples. Our algorithm is superior to a state-of-the-art method on nine benchmark datasets with various levels of imbalance ratios.
Curiskis, SA, Osborn, TR & Kennedy, PJ 2015, 'Link prediction and topological feature importance in social networks', Proceedings of the 13-th Australasian Data Mining Conference (AusDM 2015), Sydney, Australia, Australian Data Mining Conference, Australian Computer Society, Sydney, pp. 39-50.
Meng, Q, Tafavogh, S & Kennedy, PJ 2014, 'Community detection on heterogeneous networks by multiple semantic-path clustering', 2014 6th International Conference on Computational Aspects of Social Networks, CASoN 2014, International Conference on Computational Aspects of Social Networks (CASoN), IEEE, Porto, PORTUGAL, pp. 7-12.View/Download from: Publisher's site
© 2014 IEEE. Heterogeneous networks have become a commonly used model to represent complex and abstract social phenomena. They allow objects to have many different relationships and represent relationships by semantic paths which connect object types via a sequence of relations. A major challenge in community detection on heterogeneous networks is how to organize and combine different semantic paths. In order to acquire desired clustering, we propose a novel community detection method for heterogeneous networks based on matrix decomposition and semantic paths. The major advantage of this method is to treat objects individually and to assign them with different combinations of semantic-path weights so as to improve the clustering quality. The comparative experiments of the proposed method with another two state-of-the-art methods, spectral clustering and path-selection clustering, confirms that it can acquire desired clustering results better.
Tafavogh, S, Meng, Q, Catchpoole, DR & Kennedy, PJ 2014, 'Automated quantitative and qualitative analysis of whole neuroblastoma tumour images for prognosis', Proceedings of the IASTED 11th International Conference on Biomedical Engineering, IASTED International Conference on Biomedical Engineering, ACTA Press, Zurich, Switzerland, pp. 244-251.View/Download from: Publisher's site
Homayounfard, H, Kennedy, PJ & Braun, RM 2013, 'NARGES: Prediction Model for Informed Routing in a Communications Network', Advances in Knowledge Discovery and Data Mining (LNCS), Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer, Gold Coast, QLD, Australia, pp. 327-338.View/Download from: Publisher's site
There is a dependency between packet-loss and the delay and jitter time-series derived from a telecommunication link. Multimedia applications such as Voice over IP (VoIP) are sensitive to loss and packet recovery is not a merely efficient solution with the increasing number of Internet users. Predicting packet-loss from network dynamics of past transmissions is crucial to inform the next generation of routers in making smart decisions. This paper proposes a hybrid data mining model for routing management in a communications network, called NARGES. The proposed model is designed and implemented for predicting packet-loss based on the forecasted delays and jitters. The model consists of two parts: a historical symbolic time-series approximation module, called HDAX, and a Multilayer Perceptron (MLP). It is validated with heterogeneous quality of service (QoS) datasets, namely delay, jitter and packet-loss time-series. The results show improved precision and quality of prediction compared to autoregressive moving average, ARMA.
Meng, Q & Kennedy, PJ 2013, 'Discovering Influential Authors in Heterogeneous Academic Networks by a Co-ranking Method', Proceedings of the 22nd ACM International Conference on Information and Knowledge Management, ACM International Conference on Information and Knowledge Management, Association for Computing Machinery, San Francisco, California, USA, pp. 1029-1036.View/Download from: Publisher's site
Research in ranking networked entities is widely applicable to many problems such as optimizing search engines, building recommendation systems and discovering influential nodes in social networks. However, many famous ranking approaches like PageRank are limited to solving this problem in homogeneous networks and are not applicable to heterogeneous networks. Faced with this problem, we propose a co--ranking method to evaluate scientific publications and authors. This novel approach is a flexible framework based on a set of customized rules taking into account both topological features of networks and the included citations. The approach ranks authors and publications iteratively and uses the results of each round to reinforce the ranks of authors and publications. Unlike traditional approaches to assessing publication, which require a great number of citations, our method lowers this requirement. This co--ranking approach has been validated using data collected from DBLP and CiteSeer, and the results suggest that it is effective and efficient in ranking authors and publications based on limited numbers of citations in heterogeneous networks and that it has fast convergence.
Tafavogh, S, Felix Navarro, KM, Catchpoole, DR & Kennedy, PJ 2013, 'Segmenting Neuroblastoma Tumor Images and Splitting Overlapping Cells Using Shortest Paths between Cell Contour Convex Regions', Lecture Notes in Computer Science, Artificial Intelligence in Medicine in Europe, Elsevier, Murcia, Spain, pp. 171-175.View/Download from: Publisher's site
Neuroblastoma is one of the most fatal paediatric cancers. One of the major prognostic factors for neuroblastoma tumour is the total number of neuroblastic cells. In this paper, we develop a fully automated system for counting the total number of neuroblastic cells within the images derived from Hematoxylin and Eosin stained histological slides by considering the overlapping cells. We finally propose a novel multi-stage cell counting algorithm, in which cellular regions are extracted using an adaptive thresholding technique. Overlapping and single cells are discriminated using morphological differences. We propose a novel cell splitting algorithm to split overlapping cells into single cells using the shortest path between contours of convex regions
Ghous, H, Kennedy, PJ, Ho, N & Catchpoole, DR 2012, 'Functional Visualisation of Genes using Singular Value Decomposition', Proceedings of the 10th Australasian Data Mining Conference, Australian Data Mining Conference, Australian Computer Society, Sydney, Australia, pp. 53-60.
Progress in understanding core pathways and processes of cancer requires thorough analysis of many coding regions of the genome. New insights are hampered due to the lack of tools to make sense of large lists of genes identified using high throughput technology. Data mining, particularly visualisation that finds relationships between genes and the Gene Ontology (GO), has the potential to assist in functional understanding. This paper addresses the question of how well GO annotations can help in functional understanding of genes. We augment genes with associated GO terms and visualise with Singular Value Decomposition (SVD). Meaning of derived components is further interpreted using correlations to GO terms. The results demonstrate that SVD visualisation of GOaugmented genes matches the biological understanding expected in the simulated data and presents understanding of childhood cancer genes that aligns with published results
Meng, Q & Kennedy, PJ 2012, 'Determining the number of clusters in co-authorship networks using social network theory', The 2nd International Conference on Social Computing and Its Applications, International Conference on Social Computing and Its Applications, IEEE, Xiangtan, Hunan, China, pp. 337-343.View/Download from: Publisher's site
Spectral clustering is a modern data clustering methodology with many notable advantages. However, this method has a weakness in that it requires researchers to specify a priori the number of clusters. In most cases, it is a challenge to know the number of clusters accurately. Here, we propose a novel way to solve this problem by involving the concept of group leaders and members from social network theory. From the perspective of social networks, groups are organized by leaders and this can provide a hint to finding the number of clusters in social networks by identifying group leaders. However, due to the fact that a group can have more than one leader, we also propose an algorithm to combine leaders from the same group. The number of leaders after the combination is expected to be the number of clusters in a network. We validate this proposed approach by using spectral clustering to cluster data comprising the co-authorship network from the University of Technology, Sydney (UTS). The experimental results show that our proposed method is effective in determining the number of cluster and can facilitate spectral clustering to achieve better clusters compared with other methods of calculating the number of clusters
Meng, Q & Kennedy, PJ 2012, 'Using field of research codes to discover research groups from co-authorship networks', IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, IEEE, Istanbul, Turkey, pp. 289-293.View/Download from: Publisher's site
Nowadays, academic collaboration has become more prevalent and crucial than ever before and many studies of academic collaboration analysis are implemented based on coauthor ship networks. This paper aims to build a novel coauthor ship network by importing field of research codes based on Newman's model, and then analyze and extract research groups via spectral clustering. In order to explain the effectiveness of this revised network, we take the academic collaboration at the University of Technology, Sydney (UTS) as an example. The result of this study advances methods for selecting the most prolific research groups and individuals in research institutions, and provides scientific evidence for policymakers to manage laboratories and research groups more efficiently in the future.
Meng, Q & Kennedy, PJ 2012, 'Using network evolution theory and singular value decomposition method to improve accuracy of link prediction in social networks', Proceedings of the Tenth Australasian Data Mining Conference (AusDM-12),, Australian Data Mining Conference, Australian Computer Society, Sydney, pp. 175-181.
Link prediction in large networks, especially social networks, has received significant recent attention. Although there are many papers contributing methods for link prediction, the accuracy of most predictors is generally low as they treat all nodes equally. We propose an effective approach to identifying the level of activities of nodes in networks by observing their behaviour during network evolution. It is clear that nodes that have been active previously contribute more to the changes in a network than stable nodes, which have low activity. We apply truncated singular value decomposition (SVD) to exclude the interference of stable nodes by treating them as noise in our dataset. Finally, in order to test the effectiveness of our proposed method, we use co-authorship networks from an Australian university from between 2006 and 2011 as an experimental dataset. The results show that our proposed method achieves higher accuracy in link prediction than previous methods, especially in predicting new links.
Tafavogh, S, Kennedy, PJ & Catchpoole, DR 2012, 'Determining Cellularity Status of Tumors based on Histopathology using Hybrid Image Segmentation', International Joint Conference on Neural Networks, IEEE International Joint Conference on Neural Networks, IEEE, Brisbane, Australia, pp. 1-8.View/Download from: Publisher's site
A Computer Aided Diagnosis (CAD) system is developed to determine cellularity status of a tumor. The system helps pathologists to distinguish a tumor with cell proliferation from normal tumors. The developed CAD system implements a hybrid segmentation method to identify and extract the morphological features that are used by pathologists for determining cellularity status of tumor. Adaptive Mean Shift (AMS) clustering as a non-parametric technique is integrated with Color Template Matching (CTM) to construct segmentation approach. We used Expectation Maximization (EM) clustering as a parametric technique for the sake of comparison with our proposed approach. The output of our proposed system and EM are validated by two pathologists as ground truth. The result of our developed system is quite close to the decision of pathologists, and it significantly outperforms EM in terms of accuracy
Zhao, Y, Li, J, Christen, P & Kennedy, PJ 2012, 'Preface', Conferences in Research and Practice in Information Technology Series, p. vii.
Analysis and visualization of microarraydata is veryassistantfor biologists and clinicians in the field of diagnosis and treatment of patients. It allows Clinicians to better understand the structure of microarray and facilitates understanding gene expression in cells. However, microarray dataset is a complex data set and has thousands of features and a very small number of observations. This very high dimensional data set often contains some noise, non-useful information and a small number of relevant features for disease or genotype. This paper proposes a non-linear dimensionality reduction algorithm Local Principal Component (LPC) which aims to maps high dimensional data to a lower dimensional space. The reduced data represents the most important variables underlying the original data. Experimental results and comparisons are presented to show the quality of the proposed algorithm. Moreover, experiments also show how this algorithm reduces high dimensional data whilst preserving the neighbourhoods of the points in the low dimensional space as in the high dimensional space.
Anaissi, A, Kennedy, PJ & Goyal, ML 2011, 'Feature Selection of Imbalanced Gene Expression Microarray Data', Proceedings 12th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing, International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing, IEEE Computer Society, Sydney, pp. 73-78.View/Download from: Publisher's site
Gene expression data is a very complex data set characterised by abundant numbers of features but with a low number of observations. However, only a small number of these features are relevant to an outcome of interest. With this kind of data set, feature selection becomes a real prerequisite. This paper proposes a methodology for feature selection for an imbalanced leukaemia gene expression data based on random forest algorithm. It presents the importance of feature selection in terms of reducing the number of features, enhancing the quality of machine learning and providing better understanding for biologists in diagnosis and prediction. Algorithms are presented to show the methodology and strategy for feature selection taking care to avoid overfitting. Moreover, experiments are done using imbalanced Leukaemia gene expression data and special measurement is used to evaluate the quality of feature selection and performance of classification.
Ghous, H, Ho, N, Catchpoole, DR & Kennedy, PJ 2011, 'Comparing functional visualizations of genes', The 5th International Workshop on Data Mining in Functional Genomics and Proteomics: Current Trends and Future Directions, International Workshop on Data Mining in Functional Genomics and Proteomics: Current Trends and Future Directions, European Conference on Machine Learning, Athens, Greece, pp. 12-21.
Santosa, H, Milton, J & Kennedy, PJ 2011, 'HMXT-GP: an information-theoretic approach to genetic programming that maintains diversity', SAC 2011: Proceedings of the 26th Annual ACM Symposium on Applied Computing 2011, ACM Symposium on Applied Computing, Association for Computing Machinery, Taichung, Taiwan, pp. 1070-1075.View/Download from: Publisher's site
This paper applies a recent informationtheoretic approach to controlling Genetic Algorithms (GAs) called HMXT to treebased Genetic Programming (GP). HMXT, in a GA domain, requires the setting of selection thresholds in a population and the application of high levels of crossover to thoroughly mix alleles. Applying these in a treebased GP setting is not trivial. We present results comparing HMXT GP to Kozastyle GP for varying amounts of crossover and over three different optimisation (minimisation) problems. Results show that average fitness is better with HMXTGP because it maintains more diversity in populations, but that the minimum fitness found was better with Koza. HMXT allows straightforward tuning of population diversity and selection pressure by altering the position of the selection thresholds.
Vamplew, P, Stranieri, A, Ong, K, Christen, P & Kennedy, PJ 2011, 'Data Mining and Analytics 2011 (AusDM'11)', Proceedings of the Ninth Australasian Data Mining Conference (AusDM'11), Ninth Australasian Data Mining Conference, Australian Computer Society, Ballarat, Australia, pp. i-229.
We are delighted to welcome you to the Ninth Australasian Data Mining Conference (AusDM'11) being held this year in Ballarat, Victoria. AusDM started in 2002 and is now the annual flagship meeting for data mining and analytics professionals in Australia. Both scholars and practitioners present the state-of-the-art in the field. Endorsed by the peak professional body, the Institute of Analytics Professionals of Australia, AusDM has developed a unique profile in nurturing this joint community. The conference series has grown in size each year from early events held in Canberra (2002, 2003), Cairns (2004), Sydney (2005, 2006), the Gold Coast (2007), Glenelg (2008) and Melbourne (2009).
Van, A, Gay, VC, Kennedy, PJ, Barin, E & Leijdekkers, P 2011, 'Understanding risk factors in cardiac rehabilitation patients with random forests and decision trees', Proceedings of the Ninth Australasian Data Mining Conference (AusDM'11), Australian Data Mining Conference, Australian Computer Society, Ballarat, Australia, pp. 11-22.
Anaissi, A, Kennedy, PJ & Goyal, ML 2010, 'A framework for high dimensional data reduction in the microarray domain', Proceedings 2010 IEEE Fifth Conference on Bio-Inspired Computing: Theories and Applications, IEEE International Conference on Bio-Inspired Computing: Theories and Applications, IEEE, Changsha, China, pp. 903-907.View/Download from: Publisher's site
Microarray analysis and visualization is very helpful for biologists and clinicians to understand gene expression in cells and to facilitate diagnosis and treatment of patients. However, a typical microarray dataset has thousands of features and a very small number of observations. This very high dimensional data has a massive amount of information which often contains some noise, non-useful information and small number of relevant features for disease or genotype. This paper proposes a framework for very high dimensional data reduction based on three technologies: feature selection, linear dimensionality reduction and non-linear dimensionality reduction. In this paper, feature selection based on mutual information will be proposed for filtering features and selecting the most relevant features with the minimum redundancy. A kernel linear dimensionality reduction method is also used to extract the latent variables from a high dimensional data set. In addition, a non-linear dimensionality reduction based on local linear embedding is used to reduce the dimension and visualize the data. Experimental results are presented to show the outputs of each step and the efficiency of this framework.
Milton, J & Kennedy, PJ 2010, 'Entropy Profiles of Ranked and Random Populations', Proceedings of the 12th Annual Genetic and Evolutionary Computation Conference (GECCO-2010), Annual Genetic and Evolutionary Computation Confere, ACM Inc., Portland, Oregon, USA, pp. 1843-1850.View/Download from: Publisher's site
The paper describes the concept of entropy profile, how it is derived, its relationship to the number partition problem and to the information extracted from an objective function. It is hoped that discussion and criticism of this idea may shed light on why some problem representations are NP hard and other, very similar problems are relatively simple. Entropy profiles illustrate the difference between ranked and randomly ordered populations of individuals in a GA in a way which quantifies the information extracted from the objective function by the ranking process. The entropy profiles of random populations are shown to arise from the fact that there are many more of such âpathsâ through the entropy coordinate space than periodic or ranked paths. Additionally, entropy profile provides a measurable difference between periodic lowâfrequency sequences, periodic highâfrequency sequences, random sequences and those which are in some way structured, ie by an objective function or other signal. The entropy coordinate space provides a visualisation and explanation of why these profiles differ and perhaps, by way of the integer partition phase transition, also a means to understand why some problems are hard while other seemingly similar problems are straightforward to solve.
Wu, Y & Wang, G 2010, 'Papers based on presentations at the 5(th) International Symposium on Novel Materials and Their Synthesis (NMS-V) and the 19(th) International Symposium on Fine Chemistry and Functional Polymers (FCFP-XIX), 18-22 October 2009, Shanghai, China Preface', PURE AND APPLIED CHEMISTRY, INT UNION PURE APPLIED CHEMISTRY, pp. IV-IV.View/Download from: Publisher's site
Homayounfard, H & Kennedy, PJ 2009, 'HDAX: Historical Symbolic Modelling of Delay Time Series in a Communications Network', Data Mining and Analytics 2009 (AusDM'09): Conference in Research and Practice in Information Technology Volume 101, Australian Data Mining Conference, Australian Computer Society, Melbourne, Australia, pp. 129-137.
There are certain performance parameters like packet delay, delay variation (jitter) and loss, which are decision factors for online quality of service (QoS) traffic routing. Although considerable efforts have been placed on the Internet to assure QoS, the dominant TCP/IP - like the best-effort communications policy - does not provide sufficient guarantee without abrupt change in the protocols. Estimation and forecasting end-to-end delay and its variations are essential tasks in network routing management for detecting anomalies. A large amount of research has been done to provide foreknowledge of network anomalies by characterizing and forecasting delay with numerical forecasting methods.
Kennedy, PJ, Ong, K & Christen, P 2009, 'Data Mining and Analytics', Data Mining and Analytics 2009 (AusDM'09), Australian Data Mining Conference, Australian Computer Society, Melbourne, Australia, pp. 1-218.
Kennedy, PJ, Ong, KL & Christen, P 2009, 'Preface', Conferences in Research and Practice in Information Technology Series.
Aloqaily, A, Kennedy, PJ, Catchpoole, DR & Simoff, SJ 2008, 'Comparison of visualization methods of genome-wide SNP profiles in childhood acute lymphoblastic leukemia', Data Mining and Analytics 2008: Proceedings of the Seventh Australasian Data Mining Conference (AusDM 2008), Conferences in Research and Practice in IT (CRPIT), Vol. 87, Australian Data Mining Conference, Australian Computer Society, Adelaide, Australia, pp. 111-121.
Data mining and knowledge discovery have been applied to datasets in various industries including biomedical data. Modelling, data mining and visualization in biomedical data address the problem of extracting knowledge from large and complex biomedical data. The current challenge of dealing with such data is to develop statistical-based and data mining methods that search and browse the underlying patterns within the data. In this paper, we employ several data reduction methods for visualizing genome-wide Single Nucleotide Polymorphism (SNP) datasets based on state-of-art data reduction techniques. Visualization approach has been selected based on the trustworthiness of the resultant visualizations. To deal with large amounts of genetic variation data, we have chosen to apply different data reduction methods to deal with the problem induced by high dimensionality. Based on the trustworthiness metric we found that neighbour Retrieval Visualizer (NeRV) outperformed other methods. This method optimizes the retrieval quality of Stochastic neighbour Embedding. The quality measure of the visualization (i.e. NeRV) showed excellent results, even though the dataset was reduced from 13917 to 2 dimensions. The visualization results will assist clinicians and biomedical researchers in understanding the systems biology of patients and how to compare different groups of clusters in visualizations.
Ghous, H, Kennedy, PJ, Catchpoole, DR & Simoff, SJ 2008, 'Kernel-based visualisation of genes with the Gene Ontology', Data Mining and Analytics 2008: proceedings of the Seventh Australasian Data Mining Conference (AusDM 2008), Conferences in Research and Practice in IT (CRPIT), Vol. 87, Australian Data Mining Conference, Australian Computer Society, Adelaide, pp. 133-140.
Rahman, AM, Kennedy, PJ, Simmonds, AJ & Edwards, J 2008, 'Fuzzy logic based modelling and analysis of network traffic', 8th IEEE International Conference on Computer and Information Technology 2008. CIT 2008, IEEE International Conference on Computer and Information Technology, IEEE, Sydney, Australia, pp. 652-657.View/Download from: Publisher's site
Accurate computer network traffic models are required for many network tasks such as network traffic analysis and performance optimization. Existing statistical traffic modelling techniques rely on precise mathematical analysis of extensive measured data such as packet arrival time, packet size and server-side or client-side round trip time. With the advent of high speed broadband networks, gathering an acceptable quantity of data needed for the precise representation of traffic is a difficult, time consuming, expensive and in some cases almost an impossible task. In this work we developed a fuzzy logic based traffic models using imprecise data sets that can be obtained realistically. The model include a parameter, the R parameter, which is also useful for analysis of network traffic.
Roddick, J, Li, J, Christen, P & Kennedy, PJ 2008, 'Data Mining & Analytics 20068: Proceedings of the 7th Australasian Data Mining Conference (AusDM 2008)', Data Mining & Analytics 20068: Proceedings of the 7th Australasian Data Mining Conference (AusDM 2008), Australian Data Mining Conference, Australian Computer Society, Adelaide.
Roddick, JF, Li, J, Christen, P & Kennedy, P 2008, 'Preface', Conferences in Research and Practice in Information Technology Series.
Christen, P, Gao, J, Kennedy, PJ, Li, J, Li, W, Kolyshkina, I, Ong, K & Williams, G 2007, 'Data Mining, Artificial Intelligence & Analytics 2007: Proceedings of the 6th Australasian Data Mining Conference (AusDM 2007) and the 2nd International Workshop on Integrating AI and Data Mining (AIDM 2007)', Data Mining, Artificial Intelligence & Analytics 2007: Proceedings of the 6th Australasian Data Mining Conference (AusDM 2007) and the 2nd International Workshop on Integrating AI and Data Mining (AIDM 2007), Australian Data Mining Conference, Australian Computer Society, Gold Coast, Australia.
Christen, P, Kennedy, PJ, Li, J, Nayak, R, Simoff, SJ & Williams, GJ 2007, 'Preface', Conferences in Research and Practice in Information Technology Series.
Aloqaily, A & Kennedy, PJ 2006, 'Using a kernel-based approach to visualize integrated chronic fatigue syndrome datasets', proceedings of the fifth Australasian data mining conference (AUSDM06), Australian Data Mining Conference, ACS, Sydney Australia, pp. 53-61.
Beauregard, M & Kennedy, PJ 2006, 'Robust simulation of Lamprey Tracking', Parallel problem solving from nature - PPSN, Parallel Problem Solving from Nature, Springer, Rejkavik, Iceland, pp. 641-650.View/Download from: Publisher's site
Biologically realistic computer simulation of vertebrates is a challenging problem with exciting applications in computer graphics and robotics. Once the mechanics of locomotion are available it is interesting to mediate this locomotion with higher level behavior such as target tracking. One recent approach simulates a relatively simple vertebrate, the lamprey, using recurrent neural networks to model the central pattern generator of the spine and a physical model for the body. Target tracking behavior has also been implemented for such a model. However, previous approaches suffer from deficiencies where particular orientations of the body to the target cause the central pattern generator to shutdown. This paper describes an approach to making target tracking more robust.
Beauregard, M, Kennedy, PJ & Debenham, JK 2006, 'Fast simulation of animal locomotion: lamprey swimming', Professional Practice In Artificial Intelligence, World Computer Congress, Springer, Santiago, Chile, pp. 247-256.View/Download from: Publisher's site
Biologically realistic computer simulation of vertebrate locomotion is an interesting and challenging problem with applications in computer graphics and robotics. One current approach simulates a relatively simple vertebrate, the lamprey, using recurrent
Christen, P, Kennedy, PJ, Li, J, Simoff, SJ & Williams, G 2006, 'Data Mining 2006: Proceedings of the Australasian Data Mining Conference (AusDM 2006)', Data Mining 2006: Proceedings of the Australasian Data Mining Conference (AusDM 2006), Australian Data Mining Conference, Australian Computer Society, Sydney.
Christen, P, Kennedy, PJ, Li, J, Simoff, SJ & Williams, GJ 2006, 'Preface', Conferences in Research and Practice in Information Technology Series.
Luu, J & Kennedy, PJ 2006, 'Investigating the size and value effect in determining performance of Australian listed companies: a neural network approach', Proceedings of the 5th Australasian Data Mining Conference, Australian Data Mining Conference, ACS, Sydney, Australia, pp. 155-161.
Milton, J, Kennedy, PJ & Mitchell, H 2005, 'The Effect of Mutation on the accumulation of information in a genetic algorithm', 18th Australian Joint Conference on Artificial Intelligence 2005 Proceedings, Australasian Joint Conference on Artificial Intelligence, Springer, Sydney, Australia, pp. 360-368.View/Download from: Publisher's site
We use an information theory approach to investigate the role of mutation on Genetic Algorithms (GA). The concept of solution alleles representing information in the GA and the associated concept of information density, being the average frequency of solution alleles in the population, are introduced. Using these concepts, we show that mutation applied indiscriminately across the population has, on average, a detrimental effect on the accumulation of solution alleles within the population and hence the construction of the solution. Mutation is shown to reliably promote the accumulation of solution alleles only when it is targeted at individuals with a lower information density than the mutation source. When individuals with a lower information density than the mutation source are targeted for mutation, very high rates of mutation can be used. This significantly increases the diversity of alleles present in the population, while also increasing the average occurrence of solution alleles.
Kennedy, PJ, Simoff, SJ, Skillicorn, D & Catchpoole, DR 2004, 'Extracting and explaining biological knowledge in microarray data', Advances In Knowledge Discovery And Data Mining, Proceedings, Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer-Verlag Berlin, Sydney, Australia, pp. 699-703.View/Download from: Publisher's site
This paper describes a method of clustering lists of genes mined from a microarray dataset using functional information from the Gene Ontology. The method uses relationships between terms in the ontology both to build clusters and to extract meaningful cluster descriptions. The approach is general and may be applied to assist explanation of other datasets associated with ontologies.
Skillicorn, D, Simoff, SJ, Kennedy, PJ & Catchpoole, DR 2004, 'Strategies for Winnowing Microarray Data', Proceedings SIAM Bioinformatics Workshop, SIAM International Conference on Data Mining, Uppsala University, Lake Buena Vista, Florida, USA, pp. 45-51.
Whitehead, D, Skusa, A & Kennedy, PJ 2004, 'Evaluating an Evolutionary Approach for Reconstructing Gene Regulatory Networks', Artificial Life IX Proceedings of the Ninth International Conference on the Simulation and Synthesis of Artificial Life, International Conference on the Simulation and Synthesis of Living Systems, The MIT Press, Boston, USA, pp. 427-432.
Kennedy, PJ & Simoff, SJ 2003, 'CONGO: Clustering on the Gene Ontology', Congress on Evolutionary Computation. Proceedings of the 2nd Australasian Data Mining Workshop, IEEE Congress on Evolutionary Computation, University of Technology, Sydney, Canberra, Australia, pp. 181-198.View/Download from: Publisher's site
With the invention of microarray technology, researchers are capable of measuring the expression levels of ten thousands of genes in parallel at various time points of the biological process. During the investigation of gene regulatory networks and general cellular mechanisms, biologists are attempting to group genes based on the time-depending pattern of the obtained expression levels. In this paper, we propose a new memetic algorithm - a genetic algorithm combined with local search-based on a tree representation of the data - a minimum spanning tree minus; for clustering gene expression data. The combination of both concepts is shown to find near-optimal solutions quickly. Due to the minimum spanning tree representation of the data, our algorithm is capable of finding clusters of different shapes. We show that our approach is superior in solution quality compared to classical clustering methods.
Zhang, C, Yan, X, Zhang, S & Kennedy, PJ 2002, 'Mining Very Large Databases Using Software Agents', Proceedings of the International Conference on Machine Learning and Application (ICMLA 02), International Conference on Machine Learning and Application (ICMLA 02), CSREA Press, Las Vegas, USA, pp. 84-90.
Kennedy, PJ & Osborn, T 2001, 'A double-stranded Encoding Scheme with inversion operator for Genetic Algorithms', Proceedings of Genetic and Evolutionary Computation Conference, Genetic and Evolutionary Computation Conference, Morgan Kaufmann, San Francisco, USA, pp. 398-407.
Kennedy, PJ & Osborn, T 2000, 'Evolution of Adaptive Behaviour in a Simulated Single-Celled Organism', SAB 2000 Proceedings Supplement Book; Sixth International Conference on Simulation of Adaptive Behaviour: From Animals to Animats, The International Society for Adaptive Behavior, Paris, France, pp. 225-234.
- The Children's Hospital at Westmead
- Queen's University, Canada
- Western Sydney University