Accelerating Candidate Gene Discovery through Ontological Indexing of Large Scale Data Repositories

“Are any of these genes associated with my disease or phenotype? Is this candidate gene expressed in my tissue of interest?” These are examples of common questions asked virtually every day by scientists attempting to identify genes contributing to human disease. Model Organism Databases such as the Rat Genome Database (RGD) curate published data related to these questions but there is much more information available than can be manually curated. Much of this information is being deposited into large scale data repositories but extracting useable information and knowledge from this stored data is a challenging problem. The goal of our project is two fold: 1) to explore the use of ontologies and the NCBO's Web service technologies to annotate large scale repositories such as NCBI's Gene Expression Omnibus. 2) To build tools that enable researchers to use the resulting annotations to further their studies of the genetic causes of disease. I will present our work to date annotating over 1.5B data points in the GEO database and progress towards making this available to the community, both as RDF and ultimately on the RGD website.