Informatics Initiatives: Cyberinfrastructure for Evolutionary Biology

NESCent is charged to be at the forefront of the development of scientific cyberinfrastructure, with a focus on the software and human resources that are needed to enable researchers to address grand synthetic questions in evolutionary biology. To date, NESCent has undertaken the following major initiatives to address some of the most critical needs in the field.

Reusable database components for genomic and natural diversity data

The evolutionary biology community faces a data-management challenge due to an overwhelming volume of high-throughput genomic data that is being generated in a wide variety of organisms. To address this, NESCent has partnered with the Generic Model Organism Database (GMOD) project, an open-source software suite that provides a rich and flexible model organism data model, a host of intercompatible web and user interface components, and applications for curating genomic data (such as sequences, genetic maps, and published literature). NESCent aims to further develop the data model for evolutionary datatypes (named organisms, georeferenced collections, genetic and phenotypic variability, and phylogenies), to develop web applications for accessing these types of data, and to provide user support for adoption of the GMOD platform for evolutionary model organisms.

More information: GMOD helpdesk at NESCent, Heliconius database pilot project

Application of ontologies to phenotypic studies

An ontology is a structured and controlled vocabulary for formalizing knowledge within a particular domain, and is an important technology for computational processing of semantic concepts. NESCent is promoting the development and application of phenotype ontologies to evolutionary morphology, in order to enable computers to reason about phenotypic descriptions of organisms, and to relate descriptions of phenotypes in natural systems with developmental genetic information about mutant phenotypes in model organisms,

More information:

Integration of geographic data into phylogenetics

Evolution is a spatiotemporal process that is structured by the geographical history of planet Earth. Integrating the large amount of georeferenced information about organismal distributions, environmental conditions, and population data in the past and present into phylogenetic relationships and evolutionary analyses is the prerequisite for fully understanding the evolution of species through dispersal, isolation, and changing environmental and ecologic conditions. NESCent is supporting the work of postdoctoral fellow David Kidd to build phylogenetic extensions to ArcGIS, the most commonly used platform for a geographical information system (GIS). GIS is software that supports the storage, retrieval, mapping, and analysis of geographic data. Within evolutionary science GIS can be used to explicitly integrate population genetic and phylogenetic models of multiple taxa with data describing present and past environments and climate. A software tool - GeoPhyloBuilder has been developed to build a spatiotemporal phylogeographic GIS model from a tree and a set of geographical features.

Libraries for combining evolutionary analysis tools and data into automated workflows

A large number of powerful and sophisticated, model-based analysis tools exist for comparison of evolutionary models, inference of phylogenetic trees, and testing of hypotheses in a comparative framework, but very few of these tools interoperate with each other. Seamlessly incorporating these tools into automated workflows to address advanced questions on large data sets in an evolutionary context is difficult due to incompatible interfaces and data exchange formats. NESCent periodically brings together open source software developers and evolutionary biologists for intensive collaborative coding sessions known as "hackathons", which are used to help overcome barriers to intercompatibility.

NESCent's first Phyloinformatics hackathon in December 2006 jumpstarted the development of common object models and phyloinformatics "glue code" in the widely used Bio* toolkits, led to the first steps in validating and standardizing the use of the NEXUS format for data exchange, as well as to the first definition of a format for exchanging models between analysis programs.

Repository for preservation, discovery, and synthesis of published digital data in the biosciences

There is a vast and growing body of highly diverse data resulting from more than a century of published research in the biosciences and in biomedicine that is not currently being digitally archived in any form. In partnership with the MetaData Research Center at UNC, NESCent is working to establish a digital repository for published data in these fields. The goal is to provide for long-term data preservation, to allow efficient discovery of archived data, to share data in a way that respects the intellectual property rights of researchers and journals, and to assist the researcher in obtaining rich and high-quality metadata.

More information: Dryad Data Repository