|Warren Kaplan||Dmitry Degrave||Derrick Lin|
|Manuel Sopena-Ballesteros||Tansel Ersavas|
The KCCG Informatics Programs aims to organize the world’s genome information and make it universally accessible and useful to authorised users.
Every precision medicine program relies on genomic sequencing of disease cohorts. Analytical insights from these cohorts lead to determining pathogenicity of genetic variants, contribute to known Genotype-Phenotype correlations and inform patient diagnoses.
Beautiful visions of High Definition Medicine (Torkamani et al. 2017), and explicit calls to realise this vision (Topol 2015) inspires our group to contribute to this future by building the Platform to Support Precision Medicine programs at any scale.
Inspired by Google’s Mission Statement, we aspire to achieve a variation of this, in precision medicine:
To enable easy interrogation of genome cohorts of any size we have built the Vectis variant atlas platform. This platform was developed to suit the needs of diverse users including clinicians, patients, scientists and bioinformaticians.
- Search:Query specific chromosome co-ordinates, gene names and annotations in a given cohort.
- Beacon: Locate specific variants in different studies across the Global Alliance for Genomics and Health Beacon Network.
- Explore: Highly interactive real-time exploration of cohort summary statistics of genetic variants, including variant type, average allele frequencies, and reference and alternate alleles. Supports the querying of 40 million variants in real time.
- Interactive graphics: Including lollipop plots of allelic frequencies and gene transcripts.
- Secure login: Two-factor authentication with user-defined username and password or Google ID.
- Integrated web notebooks: Enabling bioinformaticians to run their own scripts and analyses in situ, while preserving their code, figures and results.
- Variant annotations: Including links out to the original supporting evidence.
- Clinical filtering: subset patients based on clinical attributes and query specific genotypes at the individual level.
GA4GH Beacon Network
Web notebooks for bioinformatics researchers
Deep Learning Initiative
Deep Learning Initiative is a group of projects that aim to transition Garvan into the coming 'Age of Artificial Intelligence'. The initiative covers projects to demonstrate abilities of deep learning and other soft computing techniques on biological and medical sciences, and organising introductory to advanced talks, seminars and workshops to popularise use of deep learning at Garvan. The initiative was started and is currently led by Tansel Ersavas under the supervision of Dr. Warren Kaplan. The leading project of the initiative is the MitoWisdom project using advanced deep learning techniques.
MitoWisdom: an unsupervised mitochondrial genome analyser using deep learning
Mitochondria are critical to cell survival as the host cell’s energy source and in regulating cell metabolism. Mitochondria’s role in cancers, degenerative diseases and ageing are increasing in prominence, and better analytic tools are required to further identify their contributions to such conditions. We are developing a clustering mechanism that uses a novel deep learning system and unsupervised learning to extract features from mitochondrial genome data at multiple dimensions. This system then can be quickly and easily re-trained to analyse mitochondria in multiple ways with minimal sample data for specialised classification of any condition or trait. We use a “convolutional autoencoder” to reduce the dimensionality of the data and use the reducer part of the autoencoder as a basis of a trained DL system. The generated encoder represents mitochondria and can now be used as a knowledge source that can be applied to any mitochondria related problem with minimal supervised training. The technique we use for the mitochondrial genome is general and is applicable to the whole genome or any selected proportions of it. This project is currently being implemented by Tansel Ersavas with data supplied by Dr. Mark Pinese, in consultation with Prof. Aleksandra Filipovska of the University of Western Australia.
Data Intensive Computer Engineering (DICE)
The increasingly rapid turn-around and plummeting costs of genome sequencing mean that most of the expense associated with genomes, will not be in their sequencing, but rather in their analyses, and the scale-out computing systems needed to analyse them. Disruptive change to computing that’s come about from commercial cloud providers like Amazon, Google, and Microsoft, brings great potential and opportunities for genomics and medicine, but requires deep understanding of the nuance associated with cloud usage. The Garvan Data Intensive Computer Engineering (DICE) Group was established to design solutions to meet these challenges.
The Data Intensive Computer Engineering (DICE) group is part of the Garvan Institute of Medical Research in Sydney, Australia. DICE is a provider of innovative computing solutions for genomics data. DICE supports Garvan’s factory-scale accredited genome sequencing operation, the single cell studies of the Garvan Weizmann Centre for Cellular Genomics, and other big genome data solutions.
Currently the DICE infrastructure supports:
- HPC computation for Garvan’s 80+ bioinformaticians (in all 5 Garvan Divisions)
- Genome.One’s production bioinformatics
- Kinghorn Centre for Clinical Genomics
The DICE comprises engineers Derrick Lin and Manuel Sopena Ballesteros, who report to Warren Kaplan (KCCG Informatics leader, Garvan Chief of Informatics). DICE’s computing infrastructure is supported by over $2 million in grants by the DICE team.
Driven by the scale and economic model of a specific problem, DICE builds solutions to run on local infrastructure, supercomputing facilities and commercial cloud environments.
Our solutions extend from hardware, networking, software infrastructure layers (like Apache Spark, Hadoop), to bioinformatics applications. We do not limit our solutions to local infrastructure, but include designs that incorporate supercomputing facilities, and commercial clouds and fast Wide Area Networks too.
While focussing primarily on genomics data, DICE has a close working relationships with other niche markets that include Finance, Agriculture and Defence. DICE also works closely with other research institutes that look to emulate the role of DICE, as well as an expert solution provider for bespoke genome data and computation challenges.
Since 2010 DICE has built customised solutions to the Garvan Institute that include:
- Bioinformatics analysis environments using GenePattern and Galaxy
- DICE Wolfpack Cluster
- The computing infrastructure for an accredited whole genome sequencing infrastructure for Genome.One
- The building of a Science Demilitarised Zone (DMZ) for the safe transfer of data in partnership with UNSW Sydney IT.
- Successful writing of over $2 million in grants to build our in-house infrastructure
- Regular invitations to Big Data Conferences
The DICE Approach
From a technical perspective DICE works in very diverse technologies that include:
- HPC (Rocskscluster)
- Panasas storage
- Mellanox Networking
- OpenStack Cloud
- Ceph storage
- Apache Spark
- Hadoop (HBase)
- Apache Kudu
Four of these technologies are used in production (1, 2, 3 and 4), with the others being used to develop new products in collaboration with Garvan researchers and is very similar to Google’s approaches to research (Spector et al. 2012). An important aspect of DICE’s business is change, with regular reconfigurations needed across the computing stack in order to support this change.
Product development and applications
We have deployed our Genome Analysis Platform in the following production environments:
Garvan Breast Cancer Program
Alex Swarbrick, Tumour Progression Group
Chris Ormondy, Genomic Cancer Medicine Program
Garvan-GWCCG Single Cell
- Australian Genomics
- Sydney Genomics Collaborative
- National Computational Infrastructure (NCI), Canberra
- Genomics England
- Vodafone Foundation
- Computer Science and Engineering, UNSW
- Centre for Pattern Recognition and Data Analytics (PRaDA)