Bringing reason to phenotype diversity, character change, and common descent




For more than a century, systematic biologists have meticulously documented the stunning biodiversity of phenotypes across the tree of life in the comparative systematics literature. This vast store of often complex character and character state descriptions informs our understanding of the evolutionary transitions that gave rise to the present diversity of life on earth. Yet, as free text in natural language, these descriptive data are not amenable to even simple computational processing, such as comparison of organisms by phenotype similarity, much less large-scale data integration and knowledge mining. I will present the approach that we have adopted within the Phenoscape project ( to expose these data to machine reasoning.  Phenoscape uses the Entity-Quality (EQ) model to transform characters and character states into formal phenotype assertions.  Data transformed in this way from the systematics literature are integrated with mutant phenotype data from model organisms in a large knowledge base (, in order to generate hypotheses about the genetic causes of evolutionary character transitions. I will discuss both successes and challenges in blending formal knowledge representation methods with descriptive biology and hypotheses of descent.  An important remaining challenge is a logic framework for reasoning over homology, i.e. descent from a common ancestor, which is required for many forms of evolutionary inference.


Hilmar Lapp is the Assistant Director for Informatics at the National Evolutionary Synthesis Center (NESCent). His research interests are in reusable and interoperable software and data, large-scale data integration, and building sustainable cyberinfrastructure. A biologist by training, he has also been programming for more than two decades, ranging from commercial applications to real-time data acquisition to bioinformatics data integration and standards. In his role at NESCent, he is involved in many of the Center’s cyberinfrastructure initiatives, and serves as senior personnel in the NSF-funded Phenoscape project (, as well as the Dryad digital repository for data supporting scientific publications ( Before joining NESCent in 2006, he worked for almost 10 years in functional genome informatics in the biopharmaceutical sector. At the Genomics Institute of the Novartis Research Foundation (GNF) in San Diego, CA, he built SymAtlas, one of the first decidedly gene-centric database integrating genome annotation databases with gene function data.