Handling big data in genomics: Garvan partners with NCI
Dr Warren Kaplan (Garvan), Prof Lindsay Botten and Prof Chris Goodnow
Media Release: 25 February 2016
The Garvan Institute of Medical Research has become a collaboration partner of the National Computational Infrastructure (NCI), bringing together the southern hemisphere’s largest genome sequencing centre and its most powerful supercomputing environment for data-intensive research. Together, the two institutions will develop systems for the secure, centralised storage and analysis of genomic information in Australia.
The collaboration will mean that the large-scale genomic data generated at Garvan can be archived in a cost-effective and secure manner. In addition, collaborating research partners will be able to analyse Garvan’s genomic information in a secure environment by using the NCI’s supercomputer or high-performance cloud computing infrastructure.
The collaboration with Garvan marks a new direction for NCI, whose hosted datasets have until now focused on geological and meteorological data, climate science, and information from satellite imagery.
Professor Lindsay Botten, Director of NCI, says, “I am very excited about the collaboration with Garvan—one which sets an important new direction for NCI, and which provides an opportunity to bring the same transformational outcomes to genomics research that its ‘big data’ technologies have delivered to the environmental portfolio.
“NCI is strongly outcome-driven, and I am therefore delighted that we are partnering with Garvan to deliver an infrastructure platform that will be crucial for genomics research at the population scale.”
Dr Warren Kaplan, Chief of Informatics at Garvan’s Kinghorn Centre for Clinical Genomics, says that NCI provides an ideal environment to accommodate Garvan’s rapidly increasing computational and data storage needs.
“There are over 70 bioinformaticians working on genomic data at Garvan, and we are generating mind-bogglingly large amounts of genomic information. Until now, Garvan researchers have stored and analysed that information within our own High Performance Computing Infrastructure.
“However, as we scale to tens of thousands of genomes per year, it’s timely that we, in collaboration with NCI, switch to a new model of storing and analysing large-scale genomic data.
“We seek to make Garvan an attractive destination for the best genome scientists in the world. With our partnership with NCI and the dedicated high-speed link connecting the two sites, Garvan is well placed to retain its position amongst the finest genome-empowered medical research institutes.”
Dr Kaplan also explains how the collaboration will facilitate responsible data sharing between Garvan and other genomic researchers across Australia.
“Genomic datasets are now so large that it’s no longer feasible to be sharing data with others by copying it to different locations. Instead, a more workable approach is for the analysis to come to the data – and we see the NCI as the natural home for Australia’s genomic data.
“By storing genomic data at NCI, it will become easier for Garvan’s collaborators across the country to access data for research purposes, while maintaining strict rules of access that ensure data remains secure.”
Professor Chris Goodnow, Deputy Director of Garvan, sees the collaboration as a big step forward in how Australia manages genomic information.
He says, “Some things are just best handled at the national scale, and the secure storage and analysis of genomic information is one of those things.
“NCI provides an academically accessible but secure computational environment, so it’s an ideal repository for the large-scale genomic datasets that Garvan is producing.”
“This is not just about Garvan and NCI – this is doing something good for all Australia.”
As Australia’s national, high-performance research computing facility, NCI manages the southern hemisphere’s most integrated supercomputer and filesystems, delivering high-quality computational and data services to researchers in three national science agencies, and nearly 30 of Australia’s universities.
NCI is home to one of Australia’s largest data catalogues, hosting over 10 petabytes (10 billion megabytes) of nationally and internationally significant research data. Its Raijin supercomputer has a peak performance of 1.2 petaflops, enabling Australian researchers to work with their data in ways that would not otherwise be possible.
Garvan is one of Australia’s leading medical research institutions, and is at the forefront of next-generation genomic sequencing in Australia. In 2014, Garvan acquired the HiSeq X Ten sequencing platform, making it possible to sequence 18,000 whole human genomes per year. At full capacity, Garvan generates approximately 1800 terabytes (1.8 billion megabytes) of archivable data annually.
Notes to Editors
What is a genome?: A genome is the entire DNA sequence of an individual. A human genome is approximately 6 billion base pairs, or letters of DNA code.
Storing genomic data: A single human genome requires at least 200 gigabytes of storage.
Analysing genomes: The analysis of a single genome requires 700 CPU core hours on the NCI’s supercomputer. In practice, using the supercomputer will enable Garvan’s researchers to process hundreds of genomes simultaneously, over a several-day period.
High-speed data transfer between Garvan and NCI: Data is transferred between Garvan and NCI through a dedicated high-speed link that is provided by AARNet. Currently, the link operates at 4 gigabits per second, making it possible to move an entire genome sequence in 3-4 minutes.
NCI funding and collaboration partners: NCI is supported by the Australian Government’s National Collaborative Research Infrastructure Strategy (NCRIS), with operational funding provided through a formal collaboration incorporating science agencies CSIRO, Bureau of Meteorology and Geoscience Australia with The Australian National University, Intersect, QCIF, Deakin University, ACE CRC, Garvan, the Australian Research Council, and a number of research-intensive universities.
Garvan Institute of Medical Research:
Dr Meredith Ross
Science Media & Communications Coordinator
T: +61 (0) 2 9295 8128
M: +61 (0) 439 873 258
National Computational Infrastructure:
M: +61 429 193 181