I am Chief of Informatics. After completing a structural biology PhD at Wits University in 1998, I joined a bioinformatics startup company, before moving to Garvan’s newly established Peter Wills Bioinformatics Centre in 2002. As bioinformatics head of the Centre my role included collaborative research and the provision of core services and infrastructure through grant funding.
I'm a member of the Data Intensive Computer Engineering (DICE) group, and oversee Garvan's High Performance Computing, OpenStack Cloud and Big Data (Hadoop and Spark) infrastructure, as well as partnerships with supercomputing and commercial cloud providers. I'm particularly interested in this area because each day I live with the consequences of this figure showing the decline in cost of genome sequencing over time (source:NHGRI).
While the figure is overused, somehow people never mention the consequences of this figure:
- When the first Human Genome was sequenced, the world shared this single treasured resource. Today, at Garvan alone we have sequenced over 10,000 Whole Human Genomes, so even though a genome is really cheap - we've got lots of them to compute on and to look after.
- While the term genome goes back to the 1920s, the size of the underlying data sizes has changed dramatically as we went from the (first) Human Genome to the 1000 Genomes Project (7x coverage, and 100 nucleotide read lenghts) to an Illumina X-Ten genome (30x and 150 nucleotide reads, but often up to 50x).
- From around 2007 when the price of genome sequencing started to deviate from Moore's Law, it effectively meant that the cost of computing a single genome increased, and continues to do so. This is not really a problem when you're only analysing a few genomes, but at Garvan's scale of more than 1200 genomes per month - finding ways to compute more efficiently is key to our success. If this trend continues into the future, our biggest expense in doing genomics will be the computing - not the sequencing.
The age of the cohort has arrived
As we're now able to sequence genomes relatively cheaply and quickly, the time to unlock beautifully curated patient cohorts, often collected over many years, has arrived. As a member of the Kinghorn Centre for Clinical Genomics, I'm part of the team driving end-to-end cohort sequencing. This includes, Whole Genome Sequencing in a clinically accredited facility, best-practice analysis pipeline, the joint-calling of variants across the cohort and the loading of the genomic variants into a Genome Variant Store. Building a scale-out Variant Store that supports analytics against the cohorts is where we are going. We are also very enthusiastic about the NIH Data Commons vision, and ultimately seek to provide our "cohorts as a service" that closely aligns with this vision.
My publications are available on pubmed:
Dr Warren KaplanEmail: Click here to Email