• strict warning: Declaration of date_handler_field_multiple::pre_render() should be compatible with content_handler_field_multiple::pre_render($values) in /srv/www/vhosts/ on line 0.
  • strict warning: Declaration of views_plugin_style_default::options() should be compatible with views_object::options() in /srv/www/vhosts/ on line 0.


To see a list of journal publications and conference proceedings which have referenced NCBO in a meaningful way click here



February 14, 2014

BACKGROUND: Gene Ontology (GO) enrichment analysis remains one of the most common methods for hypothesis generation from high throughput datasets. However, we believe that researchers strive to test other hypotheses that fall outside of GO. Here, we developed and evaluated a tool for hypothesis generation from gene or protein lists using ontological concepts present in manually curated text that describes those genes and proteins.

RESULTS: As a consequence we have developed the method Statistical Tracking of Ontological Phrases (STOP) that expands the realm of testable hypotheses in gene set enrichment analyses by integrating automated annotations of genes to terms from over 200 biomedical ontologies. While not as precise as manually curated terms, we find that the additional enriched concepts have value when coupled with traditional enrichment analyses using curated terms.

CONCLUSION: Multiple ontologies have been developed for gene and protein annotation, by using a dataset of both manually curated GO terms and automatically recognized concepts from curated text we can expand the realm of hypotheses that can be discovered. The web application STOP is available at 

December 3, 2013

BACKGROUND: Juvenile idiopathic arthritis is the most common rheumatic disease in children. Chronic uveitis is a common and serious comorbid condition of juvenile idiopathic arthritis, with insidious presentation and potential to cause blindness. Knowledge of clinical associations will improve risk stratification. Based on clinical observation, we hypothesized that allergic conditions are associated with chronic uveitis in juvenile idiopathic arthritis patients.

METHODS: This study is a retrospective cohort study using Stanford's clinical data warehouse containing data from Lucile Packard Children's Hospital from 2000-2011 to analyze patient characteristics associated with chronic uveitis in a large juvenile idiopathic arthritis cohort. Clinical notes in patients under 16 years of age were processed via a validated text analytics pipeline. Bivariate-associated variables were used in a multivariate logistic regression adjusted for age, gender, and race. Previously reported associations were evaluated to validate our methods. The main outcome measure was presence of terms indicating allergy or allergy medications use overrepresented in juvenile idiopathic arthritis patients with chronic uveitis. Residual text features were then used in unsupervised hierarchical clustering to compare clinical text similarity between patients with and without uveitis.

RESULTS: Previously reported associations with uveitis in juvenile idiopathic arthritis patients (earlier age at arthritis diagnosis, oligoarticular-onset disease, antinuclear antibody status, history of psoriasis) were reproduced in our study. Use of allergy medications and terms describing allergic conditions were independently associated with chronic uveitis. The association with allergy drugs when adjusted for known associations remained significant (OR 2.54, 95% CI 1.22-5.4).

CONCLUSIONS: This study shows the potential of using a validated text analytics pipeline on clinical data warehouses to examine practice-based evidence for evaluating hypotheses formed during patient care. Our study reproduces four known associations with uveitis development in juvenile idiopathic arthritis patients, and reports a new association between allergic conditions and chronic uveitis in juvenile idiopathic arthritis patients.

March 4, 2013

With increasing adoption of electronic health records (EHRs), there is an opportunity to use the free-text portion of EHRs for pharmacovigilance. We present novel methods that annotate the unstructured clinical notes and transform them into a deidentified patient-feature matrix encoded using medical terminologies. We demonstrate the use of the resulting high-throughput data for detecting drug-adverse event associations and adverse events associated with drug-drug interactions. We show that these methods flag adverse events early (in most cases before an official alert), allow filtering of spurious signals by adjusting for potential confounding, and compile prevalence information. We argue that analyzing large volumes of free-text clinical notes enables drug safety surveillance using a yet untapped data source. Such data mining can be used for hypothesis generation and for rapid analysis of suspected adverse event risk.

March 1, 2013

BioPortal is a repository of biomedical ontologies--the largest such repository, with more than 300 ontologies to date. This set includes ontologies that were developed in OWL, OBO and other formats, as well as a large number of medical terminologies that the US National Library of Medicine distributes in its own proprietary format. We have published the RDF version of all these ontologies at This dataset contains 190M triples, representing both metadata and content for the 300 ontologies. We use the metadata that the ontology authors provide and simple RDFS reasoning in order to provide dataset users with uniform access to key properties of the ontologies, such as lexical properties for the class names and provenance data. The dataset also contains 9.8M cross-ontology mappings of different types, generated both manually and automatically, which come with their own metadata.

December 1, 2011

Advanced statistical methods used to analyze high-throughput data such as gene-expression assays result in long lists of "significant genes." One way to gain insight into the significance of altered expression levels is to determine whether Gene Ontology (GO) terms associated with a particular biological process, molecular function, or cellular component are over- or under-represented in the set of genes deemed significant. This process, referred to as enrichment analysis, profiles a gene set, and is widely used to make sense of the results of high-throughput experiments. Our goal is to develop and apply general enrichment analysis methods to profile other sets of interest, such as patient cohorts from the electronic medical record, using a variety of ontologies including SNOMED CT, MedDRA, RxNorm, and others. Although it is possible to perform enrichment analysis using ontologies other than the GO, a key pre-requisite is the availability of a background set of annotations to enable the enrichment calculation. In the case of the GO, this background set is provided by the Gene Ontology Annotations. In the current work, we describe: (i) a general method that uses hand-curated GO annotations as a starting point for creating background datasets for enrichment analysis using other ontologies; and (ii) a gene-disease background annotation set - that enables disease-based enrichment - to demonstrate feasibility of our method.

September 1, 2011

The volume of publicly available data in biomedicine is constantly increasing. However, these data are stored in different formats and on different platforms. Integrating these data will enable us to facilitate the pace of medical discoveries by providing scientists with a unified view of this diverse information. Under the auspices of the National Center for Biomedical Ontology (NCBO), we have developed the Resource Index-a growing, large-scale ontology-based index of more than twenty heterogeneous biomedical resources. The resources come from a variety of repositories maintained by organizations from around the world. We use a set of over 200 publicly available ontologies contributed by researchers in various domains to annotate the elements in these resources. We use the semantics that the ontologies encode, such as different properties of classes, the class hierarchies, and the mappings between ontologies, in order to improve the search experience for the Resource Index user. Our user interface enables scientists to search the multiple resources quickly and efficiently using domain terms, without even being aware that there is semantics "under the hood."

August 25, 2011
Mark A. Musen,1 Natasha F. Noy,1 Nigam H. Shah,1 Christopher G. Chute,2 Margaret-Anne Storey,3 Barry Smith,4 and the NCBO team
1Center for Biomedical Informatics Research, Stanford University, Stanford, CA USA
2Division of Biomedical Statistics and Informatics, Mayo Clinic, Rochester, MN USA
3Department of Computer Science, University of Victoria, Victoria, BC Canada
4Department of Philosophy, University at Buffalo, Buffalo, NY USA
The National Center for Biomedical Ontology (NCBO) is now in its seventh year. The goals of this National Center for Biomedical Computing are to create and maintain a repository of biomedical ontologies and terminologies; to build tools and Web services to enable the use of ontologies and terminologies in clinical and translational research; to educate our trainees and the scientific community broadly about biomedical ontology and ontology-based technology and best practices; and to collaborate with a variety of groups who develop and use ontologies and terminologies in biomedicine. The centerpiece of the NCBO is a Web-based resource known as BioPortal. BioPortal makes available for research in computationally useful forms more than 270 of the world’s biomedical ontologies and terminologies, and supports a wide range of Web services that enable investigators to use the ontologies to annotate and retrieve data, to generate value sets and special-purpose lexicons, and to perform advanced analytics on a wide range of biomedical data.


October 22, 2010
Smith B, Goldberg LJ, Ruttenberg A, Glick M
J Am Dent Assoc. 2010: 141 (10):1173-5
How do we find what is clinically significant in the swarms of data being generated by today’s diagnostic technologies? As electronic records become ever more prevalent—and digital imaging and genomic, proteomic, salivaomics, metabalomics, pharmacogenomics, phenomics and transcriptomics techniques become commonplace— different clinical and biological disciplines are facing up to the need to put their data houses in order to avoid the consequences of an uncontrolled explosion of different ways of describing information. Fortunately, a new strategy to advance the consistency of data in the dental research community is emerging. The strategy is based on the idea that existing systems for data collection in dental research will continue to be used, but proposes a methodology in which past, present and future data will be described using a consensus-based controlled structured vocabulary called the Ontology for Dental Research (ODR).
October 22, 2010
Tenenbaum JD, Whetzel PL, Anderson K, Borromeo CD, Dinov ID, Gabriel D, Kirschner B, Mirel B, Morris T, Noy N, Nyulas C, Rubenson D, Saxman PR, Singh H, Whelan N, Wright Z, Athey BD, Becich MJ, Ginsburg GS, Musen MA, Smith KA, Tarantal AF, Rubin DL, Lyster P
J Biomed Inform. (2010) doi:10.1016/j.jbi.2010.10.
The biomedical research community relies on a diverse set of resources, both within their own institutions and at other research centers. In addition, an increasing number of shared electronic resources have been developed. Without effective means to locate and query these resources, it is challenging, if not impossible, for investigators to be aware of the myriad resources available, or to effectively perform resource discovery when the need arises. In this paper, we describe the development and use of the Biomedical Resource Ontology (BRO) to enable semantic annotation and discovery of biomedical resources. We also describe the Resource Discovery System (RDS) which is a federated, inter-institutional pilot project that uses the BRO to facilitate resource discovery on the Internet. Through the RDS framework and its associated Biositemaps infrastructure, the BRO facilitates semantic search and discovery of biomedical resources, breaking down barriers and streamlining scientific research that will improve human health.
July 20, 2010
Roeder C, Jonquet C, Shah NH, Baumgartner WA Jr, Verspoor K, Hunter L (2010)
Bioinformatics (26):1800-1801 doi:10.1093/bioinformatics/btq250
The Unstructured Information Mangement Architecture (UIMA) framwork and web services are emerging as useful tools for integrating biomedical text mining tools. This note describes our work, which warps the National Center for Biomedical Ontology (NCBO) Annotator - an ontology-based annotation service - to make it available as a component in UIMA workflows.
The wrapper is freely available on the web at as part of the UMIA tools distribution from the Center for Computational Pharmacology (CCP) at the University of Colorado School of Medicine. It has been implemented in Java for support on Mac OS X, Linux and MS Windows.
July 12, 2010
Jonquet C, Musen MA, Shah, NH (2010)
Journal of Biomedical Semantics(S1) doi:10.1186/2041-1480-1-S1-S1
Researchers in biomedical informatics use ontologies and terminologies to annotate their data in order to facilitate data integration and translational discoveries. As the use of ontologies for annotation of biomedical datasets has risen, a common challenge is to identify ontologies that are best suited to annotating specific datasets. The number and variety of biomedical ontologies is large, and it is cumbersome for a researcher to figure out which ontology to use.
We present the Biomedical Ontology Recommender web service. The system uses textual metadata or a set of keywords describing a domain of interest and suggests appropriate ontologies for annotating or representing the data. The service makes a decision based on three criteria. The first one is coverage, or the ontologies that provide most terms covering the input text. The second is connectivity, or the ontologies that are most often mapped to by other ontologies. The final criterion is size, or the number of concepts in the ontologies. The service scores the ontologies as a function of scores of the annotations created using the National Center for Biomedical Ontology (NCBO) Annotator web service. We used all the ontologies from the UMLS Metathesaurus and the NCBO BioPortal.
We compare and contrast our Recommender by an exhaustive functional comparison to previously published efforts. We evaluate and discuss the results of several recommendation heuristics in the context of three real world use cases. The best recommendations heuristics, rated ‘very relevant’ by expert evaluators, are the ones based on coverage and connectivity criteria. The Recommender service (alpha version) is available to the community and is embedded into BioPortal.
April 6, 2010

Thomas DG, Pappu RV, Baker NA (2010)
J. Biomed. Informatics, doi:10.1016/j.jbi2010.03.001


Data generated from cancer nanotechnology research are so diverse and large in volume that it is difficult to share and efficiently use them without informatics tools. In particular, ontologies that provide a unifying knowledge framework for annotating the data are required to facilitate the semantic integration, knowledge-based searching, unambiguous interpretation, mining and inferencing of the data using informatics methods. In this paper, we discuss the design and development of NanoParticle Ontology (NPO), which is developed within the framework of the Basic Formal Ontology (BFO), and implemented in the Ontology Web Language (OWL) using well-defined ontology design principles. The NPO was developed to represent knowledge underlying the preparation, chemical composition, and characterization of nanomaterials involved in cancer research. Public releases of the NPO are available through BioPortal website, maintained by the National Center for Biomedical Ontology. Mechanisms for editorial and governance processes are being developed for the maintenance, review, and growth of the NPO.

This work was supported by the Siteman Center for Cancer Nanotechnology Excellence and the National Center for Biomedical Ontology.

November 1, 2009

Washington, N. L., Haendel, M., Mungall, C. J., Ashburner, M., Westerfield, M., & Lewis, S. PLos Biol 2009 (7): e1000247.


Scientists and clinicians who study genetic alterations and disease have traditionally described phenotypes in natural language. The  considerable variation in these free-text descriptions has posed a hindrance to the important task of identifying candidate genes and models for human diseases and indicates the need for a computationally tractable method to mine data resources for mutant phenotypes. In this study, we tested the hypothesis that ontological annotation of disease phenotypes will facilitate the discovery of new genotype-phenotype relationships within and across species. To describe phenotypes using ontologies, we used an Entity-Quality (EQ) methodology, wherein the affected entity (E) and how it is affected (Q) are recorded using terms from a variety of ontologies. Using this EQ method, we annotated the phenotypes of 11 gene-linked human diseases described in Online Mendelian Inheritance in Man (OMIM). These human annotations were loaded into our Ontology-Based Database (OBD) along with other ontology-based phenotype descriptions of mutants from various model organism databases. Phenotypes recorded with this EQ method can be computationally compared based on the hierarchy of terms in the ontologies and the frequency of annotation. We utilized four similarity metrics to compare phenotypes and developed an ontology of homologous and analogous anatomical structures to compare phenotypes between species. Using these tools, we demonstrate that we can identify, through the similarity of the recorded phenotypes, other alleles of the same gene, other members of a signaling pathway, and orthologous genes and pathway members across species. We conclude that EQ-based annotation of phenotypes, in conjunction with a cross-species ontology, and a variety of similarity metrics can identify biologically meaningful similarities between genes by comparing phenotypes alone. This annotation and search method provides a novel and efficient means to identify gene candidates and animal models of human disease, which may shorten the lengthy path to identification and understanding of the genetic basis of human disease.

September 1, 2009

A. Ghazvinian, N. F. Noy, C. Jonquet, N. H. Shah, M. A. Musen. 8th International Semantic Web Conference (ISWC 2009), Washington, DC, Springer. In Press in 2009.

The field of biomedicine has embraced the Semantic Web probably more than any other field. As a result, there is a large number of biomedical ontologies covering overlapping areas of the field. We have developed BioPortal—an open community-based repository of biomedical ontologies. We analyzed ontologies and terminologies in BioPortal and the Unified Medical Language System (UMLS), creating more than 4 million mappings between concepts in these ontologies and terminologies based on the lexical similarity of concept names and synonyms. We then analyzed the mappings and what they tell us about the ontologies themselves, the structure of the ontology repository, and the ways in which the mappings can help in the process of ontology design and evaluation. For example, we can use the mappings to guide users who are new to a field to the most pertinent ontologies in that field, to identify areas of the domain that are not covered sufficiently by the ontologies in the repository, and to identify which ontologies will serve well as background knowledge in domain-specific tools. While we used a specific (but large) ontology repository for the study, we believe that the lessons we learned about the value of a large-scale set of mappings to ontology users and developers are general and apply in many other domains.

August 31, 2009

A. Ghazvinian, N. F. Noy, M. A. Musen. 2009 AMIA Annual Symposium, San Francisco, CA. In Press in 2009.

Creating mappings between concepts in different ontologies is a critical step in facilitating data integration. In recent years, researchers have developed many elaborate algorithms that use graph structure, background knowledge, machine learning and other techniques to generate mappings between ontologies. We compared the performance of these advanced algorithms on creating mappings for biomedical ontologies with the performance of a simple mapping algorithm that relies on lexical matching. Our evaluation has shown that (1) most of the advanced algorithms are either not publicly available or do not scale to the size of biomedical ontologies today, and (2) for many biomedical ontologies, simple lexical matching methods outperform most of the advanced algorithms in both precision and recall. Our results have practical implications for biomedical researchers who need to create alignments for their ontologies.

August 13, 2009

C. I. Nyulas, N. F. Noy, M. V. Dorf, N. B. Griffith, M. A. Musen. . Under Review in 2009.

Using ontologies to represent and drive knowledge infrastructure of software projects is the Semantic Web answer to the paradigm of Model-Driven Architecture. Advocates of this approach argue that using ontologies in this capacity provides separation of the declarative and procedural knowledge and enables easier evolution of the declarative knowledge. We have validated these conjectures in the context of BioPortal, a repository of biomedical ontologies, which was developed in our group. We are using the BioPortal Metadata Ontology to represent details about all the ontologies in the repository, including internal system information and the information that we collect from the community such as mappings between classes in different ontologies, ontology reviews, and so on. To the best of our knowledge, BioPortal is the first large-scale application that uses ontologies to represent essentially all of its internal infrastructure.
The BioPortal Metadata Ontology extends several other ontologies for representing metadata, such as the Ontology Metadata Vocabulary and the Protégé Changes and Annotations Ontology. In this paper, we show that it is feasible to describe the structure of the data that drives an application using ontologies rather than database schemas, which are used traditionally to store the infrastructure data. We also show that such approach provides critical advantages in terms of flexibility and adaptability of the tool itself. We demonstrate the extensibility of the approach by enabling representation of views on ontologies and their corresponding metadata in the same framework.

July 23, 2009

C. Jonquet, N. H. Shah, M. A. Musen. Bio-Ontologies: Knowledge in Biology, SIG, ISMB ECCB 2009, Stockholm, Sweden. Published in 2009.

As the use of ontologies for annotation of biomedical datasets rises, a common question researchers face is that of identifying which ontologies are relevant to annotate their datasets. The number and variety of biomedical ontologies is now quite large and it is cumbersome for a scientist to figure out which ontology to (re)use in their annotation tasks. In this paper we describe an early version of an ontology recommender service, which informs the user of the most appropriate ontologies relevant for their given dataset. We provide results to illustrate that situation. The recommender service uses a semantic annotation based approach and scores the ontologies according to those annotations. The prototype service can recommend ontologies from UMLS and the NCBO BioPortal and is accessible from

May 29, 2009

Natalya F. Noy, Nigam H. Shah, Patricia L. Whetzel, Benjamin Dai, Michael Dorf, Nicholas B. Griffith, Clement Jonquet, Daniel L. Rubin, Barry Smith, Margaret-Anne Storey, Christopher G. Chute, Mark A. Musen. Nucleic Acids Research, 2009(37)W170-3.

Biomedical ontologies provide essential domain knowledge to drive data integration, information retrieval, data annotation, natural-language processing and decision support. BioPortal is an open repository of biomedical ontologies that provides access via Web services and Web browsers to ontologies developed in OWL, RDF, OBO format and Protégé frames. BioPortal functionality includes the ability to browse, search and visualize ontologies. The Web interface also facilitates community-based participation in the evaluation and evolution of ontology content by providing features to add notes to ontology terms, mappings between terms and ontology reviews based on criteria such as usability, domain coverage, quality of content, and documentation and support. BioPortal also enables integrated search of biomedical data resources such as the Gene Expression Omnibus (GEO),, and ArrayExpress, through the annotation and indexing of these resources with ontologies in BioPortal. Thus, BioPortal not only provides investigators, clinicians, and developers ‘one-stop shopping’ to programmatically access biomedical ontologies, but also provides support to integrate data from a variety of biomedical resources.

March 15, 2009

C. Jonquet, N. Shah, M. Musen. AMIA Summit on Translational Bioinformatics, March 2009, San Francisco, CA, USA, 56-60.

 The range of publicly available biomedical data is enormous and is expanding fast. This expansion means that researchers now face a hurdle to extracting the data they need from the large numbers of data that are available. Biomedical researchers have turned to ontologies and terminologies to structure and annotate their data with ontology concepts for better search and retrieval. However, this annotation process cannot be easily automated and often requires expert curators. Plus, there is a lack of easy-to-use systems that facilitate the use of ontologies for annotation. This paper presents the Open Biomedical Annotator (OBA), an ontology-based Web service that annotates public datasets with biomedical ontology concepts based on their textual metadata ( The biomedical community can use the annotator service to tag datasets automatically with ontology terms (from UMLS and NCBO BioPortal ontologies). Such annotations facilitate translational discoveries by integrating annotated data.

Visit for more information.

February 5, 2009

Nigam H. Shah, Clement Jonquet, Annie P. Chiang, Atul J. Butte, Rong Chen, Mark A. Musen. BMC Bioinformatics, Vol. 10, February 2009.

In this work we generalize our methods to map text annotations of gene expression datasets to concepts in the UMLS. We demonstrate the utility of our methods by processing annotations of datasets in the Gene Expression Omnibus. We demonstrate that we enable ontology-based querying and integration of tissue and gene expression microarray data. We enable identification of datasets on specific diseases across both repositories. Our approach provides the basis for ontology-driven data integration for translational research on gene and protein expression data.

February 1, 2009

B. Smith & M. Brochhausen, in: B. Blobel P. Pharow and M. Nerlich (eds.), eHealth: Combining Health Telematics, Telemedicine, Biomedical Engineering and Bioinformatics on the Edge (Global Expert Summit Textbook, Studies in Health, Technology and Informatics, 134), IOS Press, Amsterdam, 219-234.

Ontologies are being ever more commonly used in biomedical informatics and we provide a survey of some of these uses, and of the relations between ontologies and other terminology resources. In order for ontologies to become truly useful, two objectives must be met. First, ways must be found for the transparent evaluation of ontologies. Second, existing ontologies need to be harmonised. We argue that one key foundation for both ontology evaluation and harmonisation is the adoption of a realist paradigm in ontology development. For science-based ontologies of the sort which concern us in the eHealth arena, it is reality that provides the common benchmark against which ontologies can be evaluated and aligned within larger frameworks.

December 18, 2008

Robert Arp* and Barry Smith. National Center for Biomedical Ontology, University at Buffalo, Buffalo, NY 14260, USA. A report on the workshop "Ontologies of Cellular Networks," Newark, New Jersey, 27 to 28 March 2008.

As part of a series of workshops on different aspects of biomedical ontology sponsored by the NCBO, a workshop titled "Ontologies of Cellular Networks" took place in Newark, New Jersey, on 27 to 28 March 2008. This workshop included more than 30 participants from various backgrounds in biomedicine and bioinformatics. The goal of the workshop was to provide an introduction to the basic tools and methods of ontology, as well as to enhance coordination between groups already working on ontologies of cellular networks. The meeting focused on three questions: What is an ontology? What is a pathway? What is a cellular network?

November 15, 2008

Brochhausen M, Weiler G, Martin L, Cocos C, Stenzhorn H, Graf N, Dörr M, Tsiknakis M, Smith B. In: R. Meersman, Z. Tari, P. Herrero (eds.): OTM 2008 Workshops, LNCS 5333, 2008, 1046-1055.

In this paper we present applications of the ACGT Master Ontology (MO) which is a new terminology resource for a transnational network providing data exchange in oncology, emphasizing the integration of both clinical and molecular data. The development of a new ontology was necessary due to problems with existing biomedical ontologies in oncology. The ACGT MO is a test case for the application of best practices in ontology development. This paper provides an overview of the application of the ontology within the ACGT project thus far.

October 26, 2008

In this paper, we focus on the ontology-mapping metadata and on community-based method to collect ontology mappings. More specifically, we develop a model for representing mappings collected from the user community and the metadata associated with the mapping. We use the model to bring together more than 30,000 mappings from 7 sources. We also validate the model by extending BioPortal—a repository of biomedical ontologies that we have developed—to enable users to create single concept-to-concept mappings in its graphical user interface, to upload and download mappings created with other tools, to comment on the mappings and to discuss them, and to visualize the mappings and the corresponding metadata.

September 1, 2008

The NCBO is developing a system for automated, ontology-based access to online biomedical resources. The system’s indexing workflow processes the text metadata of diverse resources such as datasets from GEO and ArrayExpress to annotate and index them with concepts from appropriate ontologies... In this paper, we present a comprehensive comparison of two concept recognizers – NIH’s MetaMap and the University of Michigan’s MGREP... Our evaluations demonstrate that MGREP has a clear edge over MetaMap for large-scale applications. Based on our analysis we also suggest areas of potential improvements for MGREP.

August 1, 2008

Nigam Shah. In the Encyclopedia of Database Systems, (Springer Verlag)
Area Editor: Vipul Kashyap

The largest source of biomedical knowledge is the published literature, where results of experimental studies are reported in natural language. Published literature is hard to query, integrate computationally or to reason over. The task of reading published papers (or other forms of experimental results such as pharmacogenomics datasets) and distilling them down into structured knowledge that can be stored in databases as well as knowledgebases is called curation. The statements comprising the structured knowledge are called annotations. The level of structure in annotation statements can vary from loose declarations of “associations“ between concepts (such as associating a paper with the concept ‘colon cancer’) to statements that declare a precisely defined relationship between concepts with explicit semantics. There is an inherent tradeoff between the level of detail of the structured annotations and the time and effort required to create them. Curation to create highly structured and computable annotations requires PhD level individuals to curate the literature.

July 24, 2008

Arp, Robert and Smith, Barry. Available from Nature Precedings <>

Numerous research groups are now utilizing Basic Formal Ontology (BFO) as an upper-level framework to assist in the organization and integration of biomedical information. This paper provides elucidation of the three BFO categories of function, role, and disposition, and considers two proposed sub-categories of artifactual function and bio-logical function. The motivation is to help advance the coherent treatment of functions, roles, and dispositions, to help provide the potential for more detailed classification, and to shed light on BFO’s general structure and use.

July 24, 2008

David P Hill, Barry Smith, Monica S McAndrews-Hill and Judith A Blake. BMC Bioinformatics 2008, 9(Suppl 5):S2 doi:10.1186/1471-2105-9-S5-S2.

To address the challenges of information integration and retrieval, the computational genomics community increasingly has come to rely on the methodology of creating annotations of scientific literature using terms from controlled structured vocabularies such as the Gene Ontology (GO). Here we address the question of what such annotations signify and of how they are created by working biologists. Our goal is to promote a better understanding of how the results of experiments are captured in annotations, in the hope that this will lead both to better representations of biological reality through annotation and ontology development and to more informed use of GO resources by experimental scientists.

July 24, 2008

Cecilia N Arighi, Hongfang Liu, Darren A Natale, Winona C Barker, Harold Drabkin, Judith A Blake, Barry Smith and Cathy H Wu. BMC Bioinformatics 2009, 10(Suppl 5):S3doi:10.1186/1471-2105-10-S5-S3.

The Protein Ontology (PRO) is designed as a formal and principled Open Biomedical Ontologies (OBO) Foundry ontology for proteins. The components of PRO extend from a classification of proteins on the basis of evolutionary relationships at the homeomorphic level to the representation of the multiple protein forms of a gene, including those resulting from alternative splicing, cleavage and/or post-translational modifications. Focusing specifically on the TGF-beta signaling proteins, we describe the building, curation, usage and dissemination of PRO.

July 16, 2008

Smith, B. Nature Preceedings, July 2008

Ontologies such as the Gene Ontology are in different respects comparable to scientific theories, to scientific databases, and to scientific journal publications. Such a view implies a new conception of what is involved in the authoring, maintenance and application of ontologies in scientific contexts, and therewith also a new approach to the evaluation of ontologies and to the training of ontologists.

June 27, 2008

C. Jonquet, M. A. Musen, N. H. Shah. International Workshop on Data Integration in The Life Sciences 2008, DILS'08, Evry, France, Springer-Verlag, 5109, Lecture Notes in BioInformatics, 144-152.

We present a system for ontology based annotation and indexing of biomedical data; the key functionality of this system is to provide a service that enables users to locate biomedical data resources related to particular ontology concepts. The system’s indexing workflow processes the text metadata of diverse resource elements such as gene expression data sets, descriptions of radiology images, clinical-trial reports, and PubMed article abstracts to annotate and index them with concepts from appropriate ontologies. The system enables researchers to search biomedical data sources using ontology concepts. What distinguishes this work from other biomedical search tools is:(i) the use of ontology semantics to expand the initial set of annotations automatically generated by a concept recognition tool; (ii) the unique ability to use almost all publicly available biomedical ontologies in the indexing workflow; (iii) the ability to provide the user with integrated results from different biomedical resource in one place. We discuss the system architecture as well as our experiences during its prototype implementation (

June 27, 2008

C. Jonquet, M. A. Musen, N. H. Shah. Technical Report, 2008.

An ontology-based annotator web service methodology that can annotate a piece of text with ontology concepts and return annotations in OWL. Currently, the annotation workflow is based on syntactic concept recognition (using concept names and synonyms) and on a set of semantic expansion algorithms that leverage the semantics in ontologies. The paper also describes an implementation of this service for life sciences and biomedicine. Our biomedical annotator service uses one of the largest available set of publicly available terminologies and ontologies. We used it to create an index of open biomedical resources.

June 1, 2008

Smith, B. & Brochhausen, M. Stud Health Technol Inform. In Press. PMID: 18376049

Ontologies are being ever more commonly used in biomedical informatics. The paper provides a survey of some of these uses, and of the relations between ontologies and other terminology resources. In order for ontologies to become truly useful, two objectives must be met. First, ways must be found for the transparent evaluation of ontologies. Second, existing ontologies need to be harmonized. The authors argue that one key foundation for both ontology evaluation and harmonization is the adoption of a realist paradigm in ontology development. For science-based ontologies of the sort which concern us in the eHealth arena, it is reality that provides the common benchmark against which ontologies can be evaluated and aligned within larger frameworks. Given the current multitude of ontologies in the biomedical domain the need for harmonization is becoming ever more urgent. An example of such harmonization within the ACGT project is described, which draws on ontology-based computing as a basis for sharing clinical and laboratory data on cancer research.

June 1, 2008

Bittner, T., Donnelly, M. & Smith, B. Journal for Geographical Information Science. 2009: 23 (6), 765-798.

This paper presents an axiomatic formalization of a theory of top-level relations between three categories of entities: individuals, universals, and collections. We deal with a variety of relations between entities in these categories, including the sub-universal relation among universals and the parthood relation among individuals, as well as cross-categorial relations such as instantiation and membership. We show that an adequate understanding of the formal properties of such relations-in particular their behavior with respect to time-is critical for geographic information processing. The axiomatic theory is developed using Isabelle, a computational system for implementing logical formalisms. All proofs are computer verified and the computational representation of the theory is available online.

May 1, 2008

Dawn Field*1, George Garrity2, Tanya Gray1, Norman Morrison3,4, Jeremy Selengut5, Peter Sterk6, Tatiana Tatusova7, Nicholas Thomson8, Michael J Allen9, Samuel V Angiuoli5,10, Michael Ashburner11, Nelson Axelrod5, Sandra Baldauf12, Stuart Ballard13, Jeffrey Boore14, Guy Cochrane6, James Cole2, Peter Dawyndt15, Paul De Vos16,17, Claude dePamphilis18, Robert Edwards19,20, Nadeem Faruque6, Robert Feldman21, Jack Gilbert9, Paul Gilna22,  Frank Oliver Glöckner23, Philip Goldstein24, Robert Guralnick24, Dan Haft5, David Hancock3,4, 
Henning Hermjakob6, Christiane Hertz-Fowler8, Phil Hugenholtz25, Ian Joint9, Leonid Kagan5, Matthew Kane26, Jessie Kennedy27, George Kowalchuk28, Renzo Kottmann23, Eugene Kolker29–31, Saul Kravitz5, Nikos Kyrpides32, Jim Leebens-Mack33, Suzanna E Lewis34, Kelvin Li5, Allyson L Lister35,36, Phillip Lord35, Natalia Maltsev20, Victor Markowitz37, Jennifer Martiny38, Barbara Methe5, Ilene Mizrachi7, Richard Moxon39, Karen Nelson5,40, Julian Parkhill8, Lita Proctor26, Owen White10, Susanna-Assunta Sansone6, Andrew Spiers42, Robert Stevens3, 
Paul Swift1, Chris Taylor6, Yoshio Tateno43, Adrian Tett1, Sarah Turner1, David Ussery44, Bob Vaughan6, Naomi Ward45, Trish Whetzel46, Ingio San Gil41, Gareth Wilson1 & Anil Wipat35,36.

With the quantity of genomic data increasing at an exponential rate, it is imperative that these data be captured electronically, in a standard format. Standardization activities must proceed within the auspices of open-access and international working bodies. To tackle the issues surrounding the development of better descriptions of genomic investigations, we have formed the Genomic Standards Consortium (GSC). Here, we introduce the minimum information about a genome sequence (MIGS) specification with the intent of promoting participation in its development and discussing the resources that will be required to develop improved mechanisms of metadata capture and exchange. As part of its wider goals, the GSC also supports improving the 'transparency' of the information contained in existing genomic databases.

April 29, 2008

David P. Hill, Barry Smith, Monica S. McAndrews-Hill, Judith A. Blake. BMC Bioinformatics. 2008; 9 (Suppl 5): S2.

March 28, 2008

D. L. Rubin, D. de Abreu Moreira, P. P. Kanjamala, M. A. Musen. AAAI Spring Symposium Series, Symbiotic Relationships between Semantic Web and Knowledge Engineering, Stanford University, 2008.

We have created BioPortal, a Web portal to a virtual library of ontologies on the Semantic Web and a tool set enabling the community to access, critique, and improve ontologies. The BioPortal library contains over 50 ontologies from the biological and medical domains. In addition to a Web interface enabling researchers in cyberspace to locate these knowledge resources, BioPortal provides a suite of Web services, including ontology categorization, term search, graphical ontology visualization, and ontology version histories. ... we are also creating novel tools in BioPortal to enable the community to create mappings between classes in related ontologies and to critique ontology content, providing feedback to ontology developers. Preliminary user experience with BioPortal has been extremely positive. BioPortal appears promising for unifying and disseminating ontology content on the Semantic Web, and it is providing tools needed by the research community to exploit these rich resources.

March 28, 2008

D. L. Rubin, P. Mongkolwat, V. Kleper, K. S. Supekar, D. S. Channin. AAAI Spring SymposiumSeries, Semantic Scientific Knowledge Integration, Stanford. March, 2008.

While Semantic Web technologies are showing promise in tackling the information challenges in biomedicine, less attention is focused on leveraging similar technologies in imaging. We are developing methods and tools to enable the transparent discovery and use of large distributed collections of medical images in cyberspace as well as within hospital information systems. Our approach is to make the human and machine descriptions of image pixel content machine-accessible through annotation using ontologies. We created an ontology of image annotation and markup, specifying the entities and relations necessary to represent the semantics of medical image pixel content. We are creating a toolkit to collect the annotations directly from researchers and physicians as they view the images on medical imaging workstations.

March 24, 2008

Ashburner, M., Leser, U. & Rebholz-Schumann, D. Dagstuhl Seminar Proc. March, 2008. URN: urn:nbn:de:0030-drops-15234

Researchers in Text Mining and researchers active in developing ontological resources provide solutions to preserve semantic information properly, i.e. in ontologies and/or fact databases. Researchers from both fields tend to work independently from each other, but there is a shared interest to profit from ongoing research in the complementary domain. The relatedness of both domains has led to the idea to organize a workshop that brings together members of both research domains.

February 1, 2008

Kotecha N, Bruck K, Lu W and Shah N.H. Accepted in Applied Ontology, Feb 2008.

The role of proteins and their function in pathways is crucial to understanding complex biological processes and their failures that lead to disease. With over 200 pathway databases in existence, it is not possible for biologists to examine a pathway in all of them. The emergence and adoption of Biological Pathways Exchange (BioPAX), a standardized format for exchanging pathway information, provides a unique opportunity to integrate knowledge from multiple pathway databases. We conducted a case study integrating multiple pathway databases using BioPAX and Oracle’s resource description framework (RDF) data repository. This integration enables querying across different species and across multiple pathway resources simultaneously. It also enables comparison of the degree of complementary across different pathway sources.

January 1, 2008

Katherine Munn, Barry Smith (Eds.), Frankfurt/Lancaster: ontos. ISBN: 978-3-938793-98-5

January 1, 2008

Okada, M. & Smith, B. (eds.) Proc of the First Interdisciplinary Ontology Meeting (Tokyo, Japan), Tokyo: Keio University Press. 2008.


January 1, 2008

In A. Burger, D. Davidson & R. Baldock (eds.), Anatomy Ontologies for Bioinformatics: Principles and Practice, New York: Springer, 289-305. 2008.

It is now increasingly accepted that many existing biological and medical ontologies can be improved by adopting tools and methods that bring a greater degree of logical and ontological rigor. In this chapter we will focus on the merits of a logically sound approach to ontologies from a methodological point of view. As we shall see, one crucial feature of a logically sound approach is that we have clear and functional definitions of the relational expressions such as ‘is_a’ and ‘part_of’. While this chapter is mainly concerned with the general issues of methodology, chapter 15, on ‘Spatial Representation and Reasoning’, will apply the methodology to the specific case of spatial relations. Although both chapters are self-contained, we recommend that they be seen as forming a unity.
January 1, 2008

Hernandez, M.-E., Falconer, S., Storey, M.-A., Carini S, & Sim I. CASCON, Toronto, Canada.

Searching and comparing information from semi-structured repositories is an important, but cognitively complex activity for internet users. The typical web interface displays a list of results as a textual list which is limited in helping the user compare or gain an overview of the results from a series of iterative queries. In this paper, we propose a new interactive, lightweight technique that uses multiple synchronized tag clouds to support iterative visual analysis and filtering of query results. Although tag clouds are frequently available in web interfaces, they are typically used for providing an overview of key terms in a set of results, but thus far have not been used for presenting semi-structured information to support iterative queries. We evaluated our proposed design in a user study that presents typical search and comparison scenarios to users trying to understand heterogeneous clinical trials from a leading repository of scientific information. The study gave us valuable insights regarding the challenges that semi-structured data collections pose, and indicated that our design may ease cognitively demanding browsing activities of semi-structured information.

January 1, 2008

D. L. Rubin, N. H. Shah, N. F. Noy. Briefings in BioinformaticsJanuary 2008. 9(1):75-90.

The objective of this review is to give an overview of biomedical ontology in practical terms by providing a functional perspective—describing how bio-ontologies can and are being used. As biomedical scientists begin to recognize the many different ways ontologies enable biomedical research, they will drive the emergence of new computer applications that will help them exploit the wealth of research data now at their fingertips.

January 1, 2008

Marinelli RJ, Montgomery K, Liu CL, Shah NH, Prapong W, Nitzberg M, Zachariah ZK, Sherlock GJ, Natkunam Y, West RB, van de Rijn M, Brown PO, Ball CA.Nucleic Acids Res., 36(Database issue): D871–D877. January 2008

The Stanford Tissue Microarray Database (TMAD) is a public resource for disseminating annotated tissue images and associated expression data. Stanford University pathologists, researchers and their collaborators worldwide use TMAD for designing, viewing, scoring and analyzing their tissue microarrays. The use of tissue microarrays allows hundreds of human tissue cores to be simultaneously probed by antibodies to detect protein abundance (Immunohistochemistry; IHC), or by labeled nucleic acids (in situ hybridization; ISH) to detect transcript abundance. TMAD archives multi-wavelength fluorescence and bright-field images of tissue microarrays for scoring and analysis.

December 8, 2007

Noy NF, Rubin DL. Web Semantics. 2008; 6 (2):133-136.

The Foundational Model of Anatomy (FMA) represents the result of manual and disciplined modeling of the structural organization of the human body. It is a tremendous resource in bioinformatics that facilitates sharing of information among applications that use anatomy knowledge. The FMA was developed in Protégé and the Protégé frames language is the canonical representation language for the FMA. We present a translation of the original Protégé frame representation of the FMA into OWL. Our effort is complementary to the earlier efforts to represent FMA in OWL and is focused on two main goals: (1) representing only the information that is explicitly present in the frames representation of the FMA or that can be directly inferred from the semantics of Protege frames; (2) representing all the information that is present in the frames representation of the FMA, thus producing an OWL representation for the complete FMA. Our complete representation of the FMA in OWL consists of two components: an OWL DL component that contains the FMA constructs that are compatible with OWL DL; and an OWL Full component that imports the OWL DL component and adds the FMA constructs that OWL DL does not allow.

December 1, 2007

Pathak, J., Johnson, T. M. & Chute, C. G. IEEE International Conference on Information Reuse and Integration, Las Vegas, NV. 2008.

In the past several years, various ontologies and terminologies such as the Gene Ontology have been developed to enable interoperability across multiple diverse medical information systems. They provide a standard way of representing terms and concepts thereby supporting easy transmission and interpretation of data for various applications. However, with their growing utilization, not only has the number of available ontologies increased considerably, but they are also becoming larger and more complex to manage. Toward this end, a growing body of work is emerging in the area of modular ontologies where the emphasis is on either extracting and managing "modules" of an ontology relevant to a particular application scenario (ontology decomposition) or developing them independently and integrating into a larger ontology (ontology composition). In this paper, we investigate state-of-the-art approaches in modular ontologies focusing on techniques that are based on rigorous logical formalisms as well as well-studied graph theories. We analyze and compare how such approaches can be leveraged in developing tools and applications in the biomedical domain. We conclude by highlighting some of the limitations of the modular ontology formalisms and put forward additional requirements to steer their future development.

November 27, 2007

Natale, D. A., Arighi, C. N., Barker, W., et al. BMC Bioinformatics, 8 (Suppl 9): S1. 2007. PMID: 18047702.

A number of ontologies describe properties that can be attributed to proteins. For example, protein functions are described by the Gene Ontology (GO) and human diseases by SNOMED CT or ICD10. There is, however, a gap in the current set of ontologies—one that describes the protein entities themselves and their relationships. We have designed the PRotein Ontology (PRO) to facilitate protein annotation and to guide new experiments. The components of PRO extend from the classification of proteins on the basis of evolutionary relationships to the representation of the multiple protein forms of a gene (products generated by genetic variation, alternative splicing, proteolytic cleavage, and other post-translational modifica­tions). PRO will allow the specification of relationships between PRO, GO and other onto­logies in the OBO Foundry.

November 27, 2007

Taylor, C. F., Field, D., Sansone S.-A., Apweiler, R., Ashburner M., Ball, C. A., Binz, P.-A., Brazma, A., Brinkman, R., Deutsch, E. W., Fiehn, O., Fostel, J., Ghazal, P., Grimes, G., Hardy, N. W., Hermjakob, H., Julian, R. K. Jr., Kane, M., Kolker, E., Kuiper, M., Le Novel, N., Leebens-Mack, J., Lewis, S. E., McNally, R., Mehrle, A., Morrison, N., Quackenbush, J., Robertson, D. G., Rocca-Serra, P., Smith, B., Snape, J., Sterk, P. & Wiemann, S. Nature Biotechnology, 26, 889-896. Aug 2008. doi: 10.1038/nbt.141

Throughout the biological and biomedical sciences there is a growing need for, prescriptive ‘minimum information’ (MI) checklists specifying the key information to include when reporting experimental results are beginning to find favor with experimentalists, analysts, publishers and funders alike. Such checklists aim to ensure that methods, data, analyses and results are described to a level sufficient to support the unambiguous interpretation, sophisticated search, reanalysis and experimental corroboration and reuse of data sets, facilitating the extraction of maximum value from data sets them. However, such ‘minimum information’ MI checklists are usually developed independently by groups working within representatives of particular biologically- or technologically-delineated domains.

November 27, 2007

Sam, L., Lussier, Y., Borlawsky, T., Li, J., Tao Y. & Smith, B. Proc of the Annu Symp of the American Medical Informatics Association, Chicago, IL. 2007. Poster.

The emphasis on evidence based medicine (EBM) has placed increased focus on finding timely answers to clinical questions in presence of patients. Using a combination of natural language processing for the generation of clinical excerpts and information theoretic distance based clustering, we evaluated multiple approaches for the efficient presentation of context-sensitive EBM excerpts.

November 11, 2007

S. M. Falconer, N. F. Noy, M. A. Storey. Conference Proceeding from the The Second International Workshop on Ontology Matching at ISWC 07 + ASWC 07, Busan, Korea. November 2007.

Ontology mapping is the key to data interoperability in the semantic web vision. Computing mappings is the first step to applications such as query rewriting, instance sharing, web-service integration, and ontology merging. This problem has received a lot of attention in recent years, but little is known about how users actually construct mappings. Several ontology-mapping tools have been developed, but which tools do users actually use? What processes are users following to discover, track, and compute mappings? How do teams coordinate when performing mappings? In this paper, we discuss the results from an on-line user survey where we gathered feedback from the community to help answer these important questions. We discuss the results from the survey and the implications they may have on the mapping research community.

November 11, 2007
Dilvan A. Moreira, PhD, Nigam H. Shah, MD, PhD, and Mark A. Musen, MD, PhD

The Gene Ontology (GO) is the most widely used ontology for creating biomedical annotations. GO annotations are statements associating a biological entity with a GO term. These statements comprise a large dataset of biological knowledge that is used widely in biomedical research. GO Annotations are available as “gene association files” from the GO website in a tab-delimited file format (GO Annotation File Format) composed of rows of 15 tab-delimited fields. This simple format lacks the knowledge representation (KR) capabilities to represent unambiguously semantic relationships between each field. This paper demonstrates that this KR shortcoming leads users to interpret the files in ways that can be erroneous. We propose a complementary format to represent GO annotation files as knowledge bases using the W3C recommended Web Ontology Language (OWL).

November 11, 2007

S. M. Falconer, M-A. Storey. International Semantic Web Conference, Busan, Korea. November 2007.

Ontology mapping is the key to data interoperability in the semantic web. This problem has received a lot of research attention, however, the research emphasis has been mostly devoted to automating the mapping process, even though the creation of mappings often involve the user. As industry interest in semantic web technologies grows and the number of widely adopted semantic web applications increases, we must begin to support the user. In this paper, we combine data gathered from background literature, theories of cognitive support and decision making, and an observational case study to propose a theoretical framework for cognitive support in ontology mapping tools. We also describe a tool called CogZ that is based on this framework.

November 7, 2007

Grenon, P. & Smith, B. In: Kanzian C (ed.). Persistence, Frankfurt/Lancaster: ontos, 33-48. 2008.

We aim to provide the ontological grounds for an adequate account of persistence. We defend a perspectivalist, or moderate pluralist, position, according to which some aspects of reality can be accounted for in ontological terms only via partial and mutually complementary ontologies, each one of which captures some relevant aspect of reality. Our thesis here is that this is precisely the sort of ontological account that is needed for the understanding of persistence, specifically an account involving two independent ontologies, one for continuants, and one for occurrents.

November 7, 2007

Barry Smith, Michael Ashburner, Cornelius Rosse, Jonathan Bard, William Bug, Werner Ceusters, Louis J Goldberg, Karen Eilbeck, Amelia Ireland, Christopher J Mungall, The OBI Consortium, Neocles Leontis, Philippe Rocca-Serra, Alan Ruttenberg, Susanna-Assunta Sansone, Richard H Scheuermann, Nigam Shah, Patricia L Whetzel & Suzanna Lewis. Nature Biotechnology 25, 1251 - 1255 (2007)
Published online: 7 November 2007 | doi:10.1038/nbt1346

Existing OBO ontologies, including the Gene Ontology, are undergoing coordinated reform, and new ontologies are being created on the basis of an evolving set of shared principles governing ontology development. The result is an expanding family of ontologies designed to be interoperable and logically well formed and to incorporate accurate representations of biological reality. We describe this OBO Foundry initiative and provide guidelines for those who might wish to become involved.

November 7, 2007

Rudnicki, R., Ceusters, W., Manzoor, S. & Smith, B. Proc of the Annu Symp of the American Medical Informatics Association, Chicago, IL. 630-634, 2007.

Referent Tracking (RT) advocates the use of instance unique identifiers to refer to the entities comprising the subject matter of patient health records. RT promises many benefits to those who use health record data to improve patient care. To further the adoption of the paradigm we provide an illustration of how data from an EHR application needs to be decomposed in order to make it accord with the tenets of RT. We describe the ontological principles on which this decomposition is based in order to allow integration efforts to be applied in similar ways to other EHR applications. We find that an ordinary statement from an EHR contains a surprising amount of "hidden" data that are only revealed by its decomposition according to these principles.

November 7, 2007

Ceusters WM, Spackman KA, Smith B. AMIA Annu Symp Proc. 2007 Oct 11:105-9.

If SNOMED CT is to serve as a biomedical reference terminology, then steps must be taken to ensure comparability of information formulated using successive versions. New releases are therefore shipped with a history mechanism. We assessed the adequacy of this mechanism for its treatment of the distinction between changes occurring on the side of entities in reality and changes in our understanding thereof. We found that these two types are only partially distinguished and that a more detailed study is required to propose clear recommendations for enhancement along at least the following lines:

November 7, 2007

The Gene Ontology Consortium. Nucleic Acids Research, Advance Access published November 4, 2007. 36:D440-D444.

The Gene Ontology (GO) project ( provides a set of structured, controlled vocabularies for community use in annotating genes, gene products and sequences (also see The ontologies have been extended and refined for several biological areas, and improvements to the structure of the ontologies have been implemented. To improve the quantity and quality of gene product annotations available from its public repository, the GO Consortium has launched a focused effort to provide comprehensive and detailed annotation of orthologous genes across a number of ‘reference’ genomes, including human and several key model organisms. Software developments include two releases of the ontology-editing tool OBO-Edit, and improvements to the AmiGO browser interface.

October 28, 2007

H. Alani, N. F. Noy, N. H. Shah, M. A. Musen. (Conference Proceeding from the Fourth International Conference on Knowledge Capture (K-CAP 2007), Whistler, BC, Canada). ACM 2007:55-62.

As more ontologies become publicly available, finding the "right" ontologies becomes much harder. In this paper, we address the problem of ontology search: finding a collection of ontologies from an ontology repository that are relevant to the user's query. In particular, we look at the case when users search for ontologies relevant to a particular topic (e.g., an ontology about anatomy). Ontologies that are most relevant to such query often do not have the query term in the names of their concepts (e.g., the Foundational Model of Anatomy ontology does not have the term "anatomy" in any of its concepts' names). Thus, we present a new ontology-search technique that helps users in these types of searches. When looking for ontologies on a particular topic (e.g., anatomy), we retrieve from the Web a collection of terms that represent the given domain (e.g., terms such as body, brain, skin, etc. for anatomy). We then use these terms to expand the user query. We evaluate our algorithm on queries for topics in the biomedical domain against a repository of biomedical ontologies. We use the results obtained from experts in the biomedical-ontology domain as the gold standard. Our experiments demonstrate that using our method for query expansion improves retrieval results by a 113%, compared to the tools that search only for the user query terms and consider only class and property names (like Swoogle). We show 43% improvement for the case where not only class and property names but also property values are taken into account.

September 15, 2007

Mabee, P. M., Arratia, G., Coburn, M., Haendel, M., Hilton, E. J., Lundberg, J. G., Mayden, R. L., Rios, N. & Westerfield, M. J Exp Zool B Mol Dev Evo, 308B, 655-668. September, 2007.

One focus of developmental biology is to understand how genes regulate development, and therefore examining the phenotypic effects of gene mutation is a major emphasis in studies of zebrafish and other model organisms. Genetic change underlies alterations in evolutionary characters, or phenotype, and morphological phylogenies inferred by comparison of these characters. We will utilize both existing and new ontologies to connect the evolutionary anatomy and image database that is being developed in the Cypriniformes Tree of Life project to the Zebrafish Information Network database. Ontologies are controlled vocabularies that formally represent hierarchical relationships among defined biological concepts. If used to recode the free-form text descriptors of anatomical characters, evolutionary character data can become more easily computed, explored, and mined. A shared ontology for homologous modules of the phenotype must be referenced to connect the growing databases in each area in a way that evolutionary questions can be addressed. We present examples that demonstrate the broad utility of this approach

September 1, 2007

B. Srinivasan, N. H. Shah, J. A. Flannick, E. Abeliuk, A. F. Novak, S. Batzoglou. Briefings in Bioinformatics, September 2007. 8(5):318-332.

The collection of multiple genome-scale datasets is now routine, and the frontier of research in systems biology has shifted accordingly. Rather than clustering a single dataset to produce a static map of functional modules, the focus today is on data integration, network alignment, interactive visualization and ontological markup. Because of the intrinsic noisiness of high-throughput measurements, statistical methods have been central to this effort. In this review, we briefly survey available datasets in functional genomics, review methods for data integration and network alignment, and describe recent work on using network models to guide experimental validation. We explain how the integration and validation steps spring from a Bayesian description of network uncertainty, and conclude by describing an important near-term milestone for systems biology: the construction of a set of rich reference networks for key model organisms.

August 28, 2007

N. H. Shah, D. L. Rubin, I. Espinosa, K. Montgomery, M. A. Musen. BMC Bioinformatics, August 2007. 8:296.

BACKGROUND: The Stanford Tissue Microarray Database (TMAD) is a repository of data serving a consortium of pathologists and biomedical researchers. The tissue samples in TMAD are annotated with multiple free-text fields, specifying the pathological diagnoses for each sample. These text annotations are not structured according to any ontology, making future integration of this resource with other biological and clinical data difficult. RESULTS: We developed methods to map these annotations to the NCI thesaurus. Using the NCI-T we can effectively represent annotations for about 86% of the samples. We demonstrate how this mapping enables ontology driven integration and querying of tissue microarray data. We have deployed the mapping and ontology driven querying tools at the TMAD site for general use. CONCLUSION: We have demonstrated that we can effectively map the diagnosis-related terms describing a sample in TMAD to the NCI-T. The NCI thesaurus terms have a wide coverage and provide terms for about 86% of the samples. In our opinion the NCI thesaurus can facilitate integration of this resource with other biological data.

July 1, 2007

Paula M. Mabee, Michael Ashburner, Quentin Cronk, Georgios V. Gkoutos, Melissa Haendel, Erik Segerdell, Chris Mungall and Monte Westerfield. Trends in Ecology and Evolution, July 2007. 22(7):345-350.

Understanding the developmental and genetic underpinnings of particular evolutionary changes has been hindered by inadequate databases of evolutionary anatomy and by the lack of a computational approach to identify underlying candidate genes and regulators. By contrast, model organism studies have been enhanced by ontologies shared among genomic databases. Here, we suggest that evolutionary and genomics databases can be developed to exchange and use information through shared phenotype and anatomy ontologies. This would facilitate computing on evolutionary questions pertaining to the genetic basis of evolutionary change, the genetic and developmental bases of correlated characters and independent evolution, biomedical parallels to evolutionary change, and the ecological and paleontological correlates of particular types of change in genes, gene networks and developmental pathways.

June 27, 2007

Woei-Jyh Lee, Louiqa Raschid, Padmini Srinivasan, Nigam Shah, Daniel Rubin, and Natasha Noy. DILS 2007, 4th International Workshop, Philadelphia, PA, June 2007. LNBI 4544:247-263.

This paper presents the LSLink (or Life Science Link) methodology that provides users with a set of tools to explore the rich Web of interconnected and annotated objects in multiple repositories, and to identify meaningful associations. Consider a physical link between objects in two repositories, where each of the objects is annotated with controlled vocabulary (CV) terms from two ontologies. Using a set of LSLink instances generated from a background dataset of knowledge we identify associations between pairs of CV terms that are potentially significant and may lead to new knowledge. We develop an approach based on the logarithm of the odds (LOD) to determine a confidence and support in the associations between pairs of CV terms. Using a case study of Entrez Gene objects annotated with GO terms linked to PubMed objects annotated with MeSH terms, we describe a user validation and analysis task to explore potentially significant associations.

June 2, 2007

Shah N, Musen M. AMIA Annu Symp Proc. 2008 Nov 6:652-6.

The Metathesaurus from the Unified Medical Language System (UMLS) is a widely used ontology resource, which is mostly used in a relational database form for terminology research, mapping and information indexing. A significant section of UMLS users use a MySQL installation of the metathesaurus and Perl programming language as their access mechanism. We describe UMLS-Query, a Perl module that provides functions for retrieving concept identifiers, mapping text-phrases to Metathesaurus concepts and graph traversal in the Metathesaurus stored in a MySQL database. UMLS-Query can be used to build applications for semi-automated sample annotation, terminology based browsers for tissue sample databases and for terminology research. We describe the results of such uses of UMLS-Query and present the module for others to use.

May 8, 2007

Ceusters, W. & Smith, B. Proc of WWW2007 Workshop i3: Identity, Identifiers, Identification (Workshop on Entity-Centric Approaches to Information and Knowledge Management on the Web), Banff, Canada, 2007.

March 1, 2007

Grasela, T. H., Fiedler-Kelly, J., Cirincione, B., Hitchcock, D., Reitz, K., Sardella, S., & Smith, B. AAPS Journal, 9(1): E84-E91. March, 2007. DOI: 10.1208/aapsj0901008. PMID: 17408238.

The current informal practice of pharmacometrics as a combination art and science makes it hard to appreciate the role that informatics can and should play in the future of the discipline and to comprehend the gaps that exist because of its absence. The development of pharmacometric informatics has important implications for expediting decision making and for improving the reliability of decisions made in model-based development. We argue that well-defined informatics for pharmacometrics can lead to much needed improvements in the efficiency, effectiveness, and reliability of the pharmacometrics process.

The purpose of this paper is to provide a description of the pervasive yet often poorly appreciated role of informatics in improving the process of data assembly, a critical task in the delivery of pharmacometric analysis results. First, we provide a brief description of the pharmacometric analysis process. Second, we describe the business processes required to create analysis-ready data sets for the pharmacometrician. Third, we describe selected informatic elements required to support the pharmacometrics and data assembly processes. Finally, we offer specific suggestions for performing a systematic analysis of existing challenges as an approach to defining the next generation of pharmacometric informatics.

January 22, 2007

Bodenreider, O., Smith, B., Kumar, A. & Burgun, A. Artificial Intelligence in Medicine, 39(3), 183-195. 2007. PMID: 17241777

The objective of this paper is to study the degree to which one DL-based biomedical terminology complies with a basic set of ontological principles. We selected SNOMED CT as target for this evaluation because it is the most comprehensive biomedical terminology recently developed in native DL formalism. Another reason for our choice is that SNOMED CT is now available as part of the UMLS3 at no charge for UMLS licensees in the U.S. It is therefore likely to become widely used in medical information systems.

January 12, 2007

Mungall, C. J., Gkoutos, G., Washington, N., & Lewis, S. OWLED proceedings.

Accurate representation of phenotypes using ontologies is important in biology and biomedicine. This paper describes the OWL translation of our methodology for representing phenotypes using ontologies in the OBO Foundry.

January 12, 2007
Carini, S., Hernandez, M., Storey, M.-A., Horvath, T., Kennedy, G., Rutherford, G., & Sim, I. AMIA Annu Symp Proc. 2008; 2008: 298–302..
Clinical questions are often studied by randomized clinical trials (RCTs) of heterogeneous design. Systematic reviewers and trial designers need to compare the design and results across these trials. If trial information is available in computer processable form, computer-based visualization techniques can provide cognitive support for such comparisons. CTeXplorer offers systematic reviewers and trial designers a tool to better and more quickly understand design heterogeneity in RCTs. CTeXplorer supports dynamic queries on eligibility criteria, interventions, and outcomes in three linked views. We tested CTeXplorer for displaying 12 RCTs on prevention of mother-to-child transmission of HIV. Three target users found the representation and organization of information intuitive and easy to learn. They were able to use CTeXplorer to achieve a quick cognitive overview of a heterogeneous group of RCTs. This work shows the benefit of capturing trial information in computable form. Future work includes leveraging ontologies to enhance CTeXplorer visualizations.
January 1, 2007

(second edition). Noy, N.F. In S. Staab & R. Studer (eds.), Springer.

An ontology is a formal description of concepts and relationships that can exist for a community of human and/or machine agents. The notion of ontologies is crucial for the purpose of enabling knowledge sharing and reuse. The Handbook on Ontologies provides a comprehensive overview of the current status and future prospectives of the field of ontologies considering ontology languages, ontology engineering methods, example ontologies, infrastructures and technologies for ontologies, and how to bring this all into ontology-based infrastructures and applications that are among the best of their kind. The field of ontologies has tremendously developed and grown in the five years since the first edition of the Handbook on Ontologies. Therefore, its revision includes 21 completely new chapters as well as a major re-working of 15 chapters transferred to this second edition.

January 1, 2007

Noy, N. F. & Musen, M. A. In C. Parent, S. Spaccapietra, H. Stuckenschmidt (eds.), Springer.

One of the original motivations for ontology research was the belief that ontologies can help with reuse in knowledge representation. However, many of the ontologies that are developed with reuse in mind, such as standard reference ontologies and controlled terminologies, are extremely large, while the users often need to reuse only a small part of these resources in their work. Specifying various views of an ontology enables users to limit the set of concepts that they see. In this chapter, we develop the concept of a Traversal View, a view where a user specifies the central concept or concepts of interest, the relationships to traverse to find other concepts to include in the view, and the depth of the traversal. For example, given a large ontology of anatomy, a user may use a Traversal View to extract a concept of Lung and organs and organ parts that surround the lung or are contained in the lung. We define the notion of Traversal Views formally, discuss their properties, present a strategy for maintaining the view through ontology evolution and describe our tool for defining and extracting Traversal Views.

December 31, 2006

Ceusters, W., Elkin, P. & Smith, B. International Journal of Medical Informatics, 76, 326-333. 2007. PMID: 17369081

PURPOSE: A substantial fraction of the observations made by clinicians and entered into patient records are expressed by means of negation or by using terms which contain negative qualifiers (as in "absence of pulse" or "surgical procedure not performed"). This seems at first sight to present problems for ontologies, terminologies and data repositories that adhere to a realist view and thus reject any reference to putative non-existing entities. Basic Formal Ontology (BFO) and Referent Tracking (RT) are examples of such paradigms. The purpose of the research here described was to test a proposal to capture negative findings in electronic health record systems based on BFO and RT.

November 11, 2006

N. H. Shah, D. L. Rubin, K. S. Supekar, M. A. Musen. AMIA Annual Symposium, Washington DC, November 2006. 709-713.

The Stanford Tissue Microarray Database (TMAD) is a repository of data amassed by a consortium of pathologists and biomedical researchers. The TMAD data are annotated with multiple free-text fields, specifying the pathological diagnoses for each tissue sample. These annotations are spread out over multiple text fields and are not structured according to any ontology, making it difficult to integrate this resource with other biological and clinical data. We developed methods to map these annotations to the NCI thesaurus and the SNOMED-CT ontologies. Using these two ontologies we can effectively represent about 80% of the annotations in a structured manner. This mapping offers the ability to perform ontology driven querying of the TMAD data. We also found that 40% of annotations can be mapped to terms from both ontologies, providing the potential to align the two ontologies based on experimental data. Our approach provides the basis for a data-driven ontology alignment by mapping annotations of experimental data.
July 23, 2006

d'Entremont, T., and M.-A. Storey. Proceedings of the 9th International Protégé Conference, Stanford University, July 23-26, 2006.

Visualizations are commonly used as a cognitive aid for preventing large ontologies and instance data. One challenge with these visual techniques is that the generated views are often very dense and complex. It is difficult to know which concepts to include in the visualization to meet a user's information needs. In this paper we present recent work that proposes using an attention-reactive interface to provide adaptive visualizations in Protege. This furthers our recent work in providing visualization "on demand" for maintenance, editing, and understanding tasks by drawing users' attention to concepts of interest within the context of the current task.

June 1, 2006

D. L. Rubin, N. F. Noy, J. D. Richter, B. Smith, M. A. Storey, H. Solbrig, C. G. Chute, I. Sim, M. Ashburner, M. Westerfield, S. Misra, C. J. Mungall, S. E. Lewis, M. A. Musen. OMICS: A Journal of Integrative Biology, June 2006. 10(2):185-198.

The National Center for Biomedical Ontology is a consortium that comprises leading informaticians, biologists, clinicians, and ontologists, funded by the National Institutes of Health (NIH) Roadmap, to develop innovative technology and methods that allow scientists to record, manage, and disseminate biomedical information and knowledge in machine-processable form. The goals of the Center are (1) to help unify the divergent and isolated efforts in ontology development by promoting high quality open-source, standards-based tools to create, manage, and use ontologies, (2) to create new software tools so that scientists can use ontologies to annotate and analyze biomedical data, (3) to provide a national resource for the ongoing evaluation, integration, and evolution of biomedical ontologies and associated tools and theories in the context of driving biomedical projects (DBPs), and (4) to disseminate the tools and resources of the Center and to identify, evaluate, and communicate best practices of ontology development to the biomedical community. Through the research activities within the Center, collaborations with the DBPs, and interactions with the biomedical community, our goal is to help scientists to work more effectively in the e-science paradigm, enhancing experiment design, experiment execution, data analysis, information synthesis, hypothesis generation and testing, and understand human disease.

This paper is part of the special issue of OMICS on data standards.