KCCG is using intelligent systems to supercharge searching the vast scientific literature.

Natural language processing set to supercharge research

PubMed is used by millions of users a month to search the scientific literature. Yet the platform has some limitations, such as the difficulty of searching for a gene that has multiple names or is implicated in multiple disorders.

Ahmed Muaz, Principal Software Engineer in the Phenomics Program, is building an app in partnership with Monarch Initiative to make it easier for clinicians and researchers to find what they need in the millions of articles in PubMed. What sets it apart is its use natural language processing (NLP), a form of artificial intelligence.

Every day, the platform takes in any new articles, then uses NLP to recognise biomedical terms like genes and phenotypes.

“The platform works as a continuous processing pipeline, starting with cleaning the article text, identifying linguistic structures like sentences and tagging different parts of speech like the subject and object,” said Mr Muaz.

“This structured data is then passed to individual annotation engines like Gene Marker and Phenotype Concept Recognizer and tagged with biomedical terms”.

These terms power the search, done through a user friendly interface developed by Ali Azarkish. When users search by genes and phenotypes, their search terms are highlighted in the abstracts of the resulting articles. 

More powerfully, any other genes and phenotypes in the abstracts are also highlighted and the platform lets users filter the results by these associated terms. Some of these associations may be predictable, such as a search for “ovarian disease” and “BRCA2” bringing up “BRCA1”. Others may be more novel.

“This tool boosts clinicians and researchers understanding of genomic disease, and even their ability to make diagnoses, by showing how genotypes and phenotypes are connected in ways that they may not have expected,” said Dr Tudor Groza, Head of the Phenomics Program.

At a rate of 30 articles per second, the platform has already processed approximately 27 million articles from PubMed to recognise genes and phenotypes within them. It has done further annotation on 17.3 million of these abstracts.

Multiple research groups internationally are interested in using the platform. The Phenomics team are working on training the program to also recognise diseases, genetic variants, symptoms, therapies and even relationships between these terms to allow for even more powerful searching.