Phenomics Team

Team members

Dr Tudor Groza Dr Simon Kocbek Ahmed Muaz Dr Frank Lin




Dr Tudor Groza


Tel: + 61 (0)2 9355 5717

Mission statement

The KCCG Phenomics program aims to provide comprehensive solutions to enrich the understanding of the associations occurring between diseases, genotype, phenotype and environment via structured knowledge representation and discovery techniques.


Phenomics is the study of the collective physical and biochemical characteristics of an organism. Phenomics plays a crucial role in enabling genomic data to be interpreted and generating disease diagnoses.

For genomic research to advance and improve patient care, tight integration of phenomics and genomics is required. This will enable

  • Interpretation and prioritization of the millions of variants present in each patient
  • Comprehensive linkage of detailed phenotypic terms to genomic variants and diseases

Combining detailed phenotype profiles with clinical genomics data will drive progress in our understanding of the human genome and enable effective integration of genomics into the clinic, to support faster and more accurate diagnosis of rare and complex conditions.

The KCCG Phenomics Program applies natural language processing and ontology-driven techniques to recognise phenotypic information in electronic medical records (EMR), patient notes, case reports, and scientific and medical literature. This information is converted into machine-readable terminology, such as that based on the Human Phenotype Ontology (HPO), and enables automated processes for converting large volumes of unstructured text into new knowledge, phenotype analytics and visualisation tools for patients, clinicians and researchers.

Core activities

Machine Readable Knowledge Representation

Computational phenotyping can only be achieved if the knowledge around diseases, medically-relevant phenotypes and medically-relevant risk factors is expressed in a machine-readable / interpretable representation. The team contributes to the efforts of creating and enriching ontologies modelling these domains as part of the global Monarch Initiative.

Automated Knowledge Acquisition and Analytic Tools

Information Extraction

Most of today’s clinical data is stored in the form of free text notes or observations. We need to bridge the gap between this unstructured representation of the data and the machine-readable representation of the corresponding knowledge. The Phenomics team devises Natural Language Processing and Machine Learning mechanisms to extract meaningful concepts from free text data, to enable computational phenotyping and phenotype analytics.

Phenotype analytics

Information extraction provides the means to create a channel between the now structured clinical data and the existing body of bio-medical knowledge. This, subsequently, supports various analytical tasks. The Phenomics team focuses on decision support methods to aid diagnosis, and hence uses phenotype-driven approaches to explore candidate disorders or to prioritise gene interpretation.

Visualisation tools

The analytical methods developed by the Phenomics team aim to support the decision making process by providing the clinician with exploratory options. Increased efficiency and usefulness of these options can only be achieved if they are presented in an intuitive and easy-to-use manner. The team considers the development of visualisation tools with a focus on representing complex knowledge in a user-friendly way as a critical component of the overall vision.

Product development and clinical applications

The software developed by the Phenomics Team accelerates translational and clinical applications of genomic technologies through harmonising phenomic information and the intelligent distillation of its informative content. It also enables phenotypic analyses to provide a translational bridge from genome-scale biology to a patient-centered view on human disease pathogenesis.

Patient Archive

A clinical grade phenotype-oriented patient data management platform combining the richness of the Human Phenotype Ontology with highly intuitive user interfaces to aid the discovery and decision-making process in the context of clinical genomics. This platform enables deep computational phenotyping and collaboration by local and global patient data sharing (via the MatchMaker Exchange Initiative).



Patient Archive is the only platform that enables clinicians to use free text clinical notes for structured patient phenotyping, store the data in a secure manner (patient sensitive data is encrypted) and share the data via a fine-grained access control model. Furthermore, the platform provides support for intelligent analytics, focused on disease exploration, patient match-making and prescriptive phenotyping. The demo version of the latest release is available at Get in touch with us for a demo of the latest development version.

High precision phenotype concept recognition 

A unique solution for high precision phenotype concept recognition which enables a fast and accurate mapping of free text clinical data to Human Phenotype Ontology concepts. It is currently used in various applications, including Patient Archive, MyGene2 and the Monarch Initiative.


The underlying technique has been built to cater for the high lexical variability associated with clinical phenotypes, as well as for decomposing coordinated terms or detecting non-canonical phenotypes. The concept recognizer is part of the Patient Archive and serves other platforms, such as University of Washington’s MyGene2 ( or Baylor’s OMIM Explorer ( It has also been used to generate the first HPO annotation dataset for common diseases – available via our Pubmed Browser ( The HPO CR package is freely available for academic use on request.

Coming soon: Journal manuscript annotator – for creating structured pheno-packets

Community-driven knowledge curation

An innovative platform for curating domain knowledge, focused on intuitive and friendly user interfaces and workflow-based knowledge curation and acquisition. An area of development aimed at improving the current knowledge curation workflows in the context of rare disease nomenclatures. An instance of such a platform is currently used by the Orphanet consortium to curate the editorial process and content of the Orphanet terminology.


The system will soon be available for public use in the context of the Orphanet data curation initiative.

Selected publications


JAX Labs, US
Peter Robinson

Oregon Health & Science University, US
Melissa Haendel

Sanford Health, US
Cornelius Boerkoel

Berkeley Labs, US
Chris Mungall

Genomics England, UK
Damian Smedley,

Orphanet, France
Ana Rath

Charite Medical University Hospital, Berlin, Germany
Sebastian Koehler

Keio University, Japan
Kenjiro Kosaki

Database Centre for Life Science (DBCLS), Japan
Jing-Dong Kim

Office of Population Health Genomics (OPHG) Perth, Western Australia
Hugh Dawkins

Genetic Services Western Australia
Gareth Baynam

CSIRO, Australia
David Hansen

Sick Kids Hospital, Toronto, Canada
Orion Buske.