Machine Learning/Data Mining for Cancer Genomics Bernard Manderick, Vrije Universiteit Brussel Henry Nyongesa, University of the Western Cape Collaboration: Artificial Intelligence Laboratory VUB Intelligent Systems Laboratory - UWC South Africa National Bioinformatics Institute - SANBI Interuniversity Institute of Bioinformatics in Brussels - (IB) 2
Project Outline Machine Learning (ML) is a rapidly growing field of research both in terms of new techniques and applications. The most exciting aspect of ML is the No Free Lunch principle that states no single ML-algorithm is optimal on types of problems, and hence therefore you can t have a priori knowledge if any one technique is the most suitable for a particular problem. In this research we focus on using ML for mining large genomic data sets in order to class human tumours.
Big Data and the Curse of Dimensionality The number of different types of datasets in the public domain continues to grow exponentially. Academic computing research is currently addressing so called big data" solutions to make sense of the vast datasets coming out of research in other disciplines, including genomics and bioinformatics. Such data sets are large scale and highly multi-dimensional and not easily amenable to traditional data analysis tools. ML techniques are suited for automated knowledge discovery from large complex data sets.
Data Mining and Cancer Genomics Data mining addresses the problem of discovering patterns, regularities and structure within data collections. The field is for this reason, also referred to as knowledge discovery from databases (KDD). Such discovered knowledge can then be applied to make predictions on similar datasets, suggest explanation of dependencies between independent variables, or generally improve decision making. Cancer is increasingly becoming more common in African populations. Gene Expression profiling can be used to distinguish between known cancer sub-types, and discover new types that may have remained unknown to pathologists.
Research Collaboration between VUB and UWC Capacity building in competences for advanced research and scholarship: VUB investigators will offer support, expertise and training to UWC staff and students. Collaborative research into novel machine learning and data mining techniques: Next Generation data mining techniques will require new machine learning algorithms, and new methods for information storage and retrieval, feature selection and selection optimization, and optimization of decision making. International cooperation through staff and student exchange visits: Make available and accessible to collaborating groups research and educational material developed by either group.
Long Term Objectives Establishment of a Centre for Machine Learning and Data Mining Applications at UWC. Human capacity development in the field of machine learning and data mining. Recognition of South African science and technology through international cooperation and collaboration. Dissemination of South African research output in international workshops and conferences.
Significance of Research This research aims to answer basic questions in cancer genomics research: What is the most optimal and relevant representation of genomic data? How to diagnose a patient based gene expression profile and on the knowledge gained from previously assayed patients. How to best integrate genome-wide analytic tool into the large and rapidly increasing amount of genome-wide datasets.
Mode of Collaboration Through staff and student exchange visits, and joint co-supervision of research students. Participate in joint authorship, publication and dissemination of research papers. Host a research conference/workshop in each year of the project, jointly organised by the partners, at UWC during which identified international experts and other field scientists shall be invited. Make project decisions jointly in a democratic fashion, and with the maximum amount of information available, after discussions at face-to- face meetings or by email. Develop and establish an online collaboration platform within the project to document data and code development, and other information related to the project.
Work packages and Timelines WP1: Use Distributed/High Performance Computing for large sclae genome analysis. WP2: Visits to enhance collaboration between partners WP 3: Africa Workshops on Artificial Intelligence, Machine Learning, Data Mining and Bioinformatics Establishment of UWC Centre for Artificial Intelligence and Data Mining.
Collaboration Professor Alan Christoffels, Director, SANBI, UWC. Professor Ann Nowe, VUB/(IB) 2 (Machine Learning/Bioinformatics) Professor Tom Lenaerts, VUB/(IB) 2 (Bioinformatics)