MIAPA: Developing a Minimal Reporting Standard for Phylogenetics
Arlin Stoltzfus (National Institute of Standards and Technology)
Overview and purposes
Domain scientists with an interest in the archiving and re-use of phylogenetic data have called for (but not yet developed) a reporting standard designated "Minimal Information for a Phylogenetic Analysis", or MIAPA (Leebens-Mack, et al. 2006). The vision of these scientists is that the research community would develop, and adhere to, a standard that imposes a minimal reporting burden yet ensures that the reported data can be interpreted and re-used. Such a standard might be adopted by
- journals that publish supplementary material for phylogenetic studies (e.g., MBE, Systematic Biology)
- repositories and databases designed to archive published data (e.g., TreeBASE, Dryad)
- pipeline projects that generate phylogenetic data sets for re-use (e.g., TreeFam, Pandit, Hovergen,PhylomeDB)
- granting organizations that support phylogenetic studies (e.g., NSF)
- organizations that develop taxonomic nomenclature based on phylogenies
This White Paper sketches a pathway for the development of a MIAPA standard, and addresses how NESCent could assist in this process. Clearly the above vision of MIAPA aligns with NESCent's programmatic interest in the infrastructure to support evolutionary research. Furthermore, the involvement of NESCent in its specific role as a synthesis center is highly appropriate in that successful standardization requires not just technology development, but the cultivation of a community dynamic involving various types of end-users, software developers, and service-providers. NESCent involvement could aid in rapid and effective promulgation of a MIAPA standard, while generating useful synergies with existing NESCent projects (e.g., Dryad) and working groups (e.g., Evolutionary Informatics).
Facilitating Data Re-use and Study Replication via Community Standards
A simple case of data re-use would be that of one scientist citing a numeric result extracted from a published paper, e.g., in order to justify an assumption. We can think of data re-use as any transaction in which a secondary user makes use of data published by a primary user (author). Study replication is a particularly demanding kind of data re-use, in which the secondary user needs an exact recipe for how the data were produced and analyzed.
Why facilitate data re-use and replication?
Science is progressive and self-policing. New studies build on old results, thus re-use of data is crucial to the progressive aspect of science. The self-policing aspect of science depends on the potential that new studies built on a faulty foundation will fail, and more acutely, that attempts to repeat (replicate) a faulty study will fail, casting doubt on it.
For these reasons, professional associations, publishers, and funding agencies have recognized that availability of the data underlying published scientific findings is essential to a healthy scientific process (see the Data Sharing article on wikipedia for references to data sharing policies in the US). Authors of scientific studies often are required (as a condition of funding or of publication) to make such results available to the research community without restriction.
How data standards may facilitate data re-use
Policies that require "data sharing" are a step in the right direciton, but they do not ensure that re-use of data is feasible or effective. Scientists who have tried to re-use data obtained from others quickly learn that mere availability does not ensure re-useability.
To conceptualize barriers to re-use, lets imagine a data re-use system that is based solely on lab notebooks and that is administered by scientists themselves, who simply agree to photocopy pages and mail them out on request. Such a system presents many barriers. First, its difficult to query. Even if we could search all notebooks for items of interest, e.g., weight and age of individuals of the species "Drosophila melanogaster" (for example), we have no guarantee that the desired lab notebooks would contain the key terms, e.g., most "Drosophila melanogaster" researchers simply use the term "flies" among themselves. In addition, the retrieval system is flawed. Even if our query identifies 10 "hits" (lab notebooks containing data on weights of flies) we would have no way of knowing this, because the scientists might forget to mail out the pages. Even if we get the notebook pages, they might be in an unfamiliar language, with no explicit indication of what is the language. Photocopying might have introduced a blank spot over some data, and there would be no way to know this (to check data integrity). The photocopied page would not reveal the author or notebook source, so the data would be hard to validate or trace.
A data standard is a way to address such problems, by specifying rules for creating and handling records in such a way that their value is clear and maintainable. For instance, we might insist that every compliant biological record include the species source using a standard taxonomic nomenclature, and that every compliant database must have the capacity to retrieve records in response to a query on the species name.
Use of a standard requires compliant technology implementations
We can break down the use of a reporting standard into separate steps.
- Generation: generation of compliant report or record by end-user
- Deposition: transfer and incorporation of compliant report into repository
- Databasing: provision of a service to store, query and retrieve records
- Re-use: acquisition and analysis of a record by another end-user
- Maintenance: revision and possibly replacement or record, and of standard
Clearly the success of these operations requires compliant technologies: the data standard itself is not enough. Users must have ways to create compliant reports, and there must be databases that will accept compliant reports and incorporate them in a form that allows query and retrieval.
Successful promulgation of a standard is a community project
Finally, the existence of a good data standard, even with good technology support, does not promote interoperability or data re-use without "buy-in" from the community of stake-holders (all those who will be affected by the standard). The need for "buy-in" starts early in the process of developing a standard.
Thus the success of a standard is a community dynamic involving 1) end-users and service providers that agree to use the standard, 2) software developers that provide tools to facilitate these uses, and 3) effective mechanisms for disseminating, maintaining, and upgrading the standard. Successful implementation of a standard typically requires a broad consortium that attends to all aspects of this community dynamic.
Developing a MIAPA strategy
Biology is a rapidly changing field, with new types of data appearing constantly, as well as new methods for generating and analyzing old types of data. Biologists want to support data re-use but are wary of heavyweight standards that may be difficult to adapt, narrow in scope, and burdensome for end-users. This has led to the promulgation of various "minimal" reporting standards, the most well known of which is MIAME or Minimal Information for a Microarray Experiment. The focus of "minimization" is the effort of end-users to comply with the standard.
The MIAPA proposal by Leebens-Mack, et al. is based on this concept of a minimal standard. Currently, MIAPA is hypothetical and aspirational, i.e., no standard has been developed. As a starting point, Leebens-Mack, et al. suggest that a study should report objectives, sequences, taxa, alignment method, alignment, phylogeny inference method, and phylogeny. This implies that MIAPA is intended only for studies of molecular sequence data, though phylogeny inference may be done with other types of data.
As explained above, a standard does not succeed on its own, but requires technology infrastructure that supports the standard, as well as a community dynamic of support. A MIAPA strategy therefore is not just a strategy for designing a standard, but a strategy for designing, supporting, and promulgating the standard. A number of issues will come into play:
- Knowledge collection (prior to designing standard) How will MIAPA developers learn and record background knowledge for developing the standard?
- systematic analysis of literature (e.g., search on "phylogeny" articles)
- panels of experts
- generation of a set of annotated examples
- Scope What is the scope and complexity of "phylogenetic analysis" reports to which MIAPA will apply?
- is the scope restricted to reports deriving a phylogeny from observed data, as opposed to
- studies that use a pre-derived phylogeny for reconstruction or dating?
- studies that derive phylogenies from simulated (not observed) data
- does it apply to Bayesian studies that integrate over a phylogenetic distribution to compute some other statistic (e.g., a gain-loss ratio DOI:10.1126/science.288.5475.2349) and do not report token trees or modal trees to represent this distribution?
- Is the scope restricted to studies involving molecular sequence data (as the MIAPA paper implies)?
- does it apply to population studies for sexual organisms?
- does it apply to tree-like constructs that can be construed, or are construed by the originators, as something other than paths of descent-with-modification (e.g., phenograms, cladograms)
- is the scope restricted to reports deriving a phylogeny from observed data, as opposed to
- Reuseability criteria What qualities of a MIAPA record make its data reusable (i.e., make it easier to re-purpose data and build on results)?
- data accessible in standard formats
- capacity for validation
- provenance, ideally, provenance that can be traced automatically via external references
- description of methods sufficient to reproduce results from data
- rich semantics in any descriptions or annotations
- Infrastructure for promulgating standard What technology and artefacts are needed to successfully promulgate the standard:
- a data exchange format (e.g., an XML-based format) for MIAPA documents
- a repository to store MIAPA-compliant entries
- a controlled vocabulary for meta-data (e.g., Concept Glossary or ontology)
- software tools and libraries to support MIAPA-compliant annotations
- interactive software to facilitate creation of MIAPA-compliant documents
- a relational mapping of the MIAPA standard to be used in repositories
- service to validate MIAPA
- Minimizing compliance burden What qualities of a standard make it "minimal"? How can technology ease the burden of compliance?
- fewer categories of metadata
- fewer arbitrary restrictions on format
- familiarity of metadata concepts
- flexibility in representation
- software support for annotation
- Compliance policy Given the standard, what does it mean to comply? How does a user know if a report or a service is MIAPA compliant?
- separate compliance policies for different aspects of the standard
- a standard for what is a compliant data exchange format
- a standard for what is a compliant repository
- a standard for what is a compliant publisher
- a way to assess and monitor compliance
- separate compliance policies for different aspects of the standard
- Community involvement What community-based mechanisms or resources are needed to facilitate success?
- a MIAPA consortium with representatives from data resources, publishers, researchers, and programmers
- a research infrastructure for gathering and recording background information used in design
- a community evaluation strategy, including
- a means to coordinate user testing (e.g., in classes and at scientific conferences)
- collaboration with ontology experts, including those at NCBO
- collaborative projects (including hackathons) to develop compliant technologies
- workshops to train users and service-providers in compliance strategies
A Role for NESCent
Successful development and implementation of a MIAPA standard will take time and will involve work at a variety of levels. NESCent could play a facilitating or organizing role in various aspects of this process, as outlined below.
Forming a MIAPA Consortium and providing IT support
Development and implementation of MIAPA should be guided by a consortium representing the interests of scientific end-users, scientific service-providers (including publishers), software developers, and possibly funding agencies. NESCent's high visibility in the evolution research community can be leveraged to assemble a MIAPA consortium. NESCent also could provide information technology services such as mailing lists and web sites to support the MIAPA consortium in its work.
This stage does not necessarily require support for a physical meeting of the entire consortium. After the consortium is formed, meetings of small groups will be required (e.g., standards design team). If a large meeting is needed, the consortium could piggy-back on a major international conference.
Supporting a MIAPA Data Standard Design Team
Perhaps the first goal for the MIAPA consortium would be to assemble a design team and charge it with developing an initial MIAPA standard for evaluation and further refinement. This team would need to work closely with-- and might overlap in membership with-- a core software team developing tools to support the standard, and an evaluation team to assess the design.
The initial design might have a narrow scope, e.g., molecular sequence phylogenetics, however the design team should begin with a broad and clear appreciation for the needs of different stakeholders:
- researchers submitting reports;
- publishers requiring reports;
- providers of repository services (receipt, storage, query, retrieval of reports);
- providers of other compatible software (composing, editing, viewing, validating reports);
- researchers re-using data in order to advance scientific knowledge.
NESCent could support such a design team as a working group.
Organizing Knowledge Acquisition for MIAPA design and evaluation
Mechanisms to gather expert knowledge from researchers are needed in two separate phases of the development process: initial design requirements, and evaluation. For both purposes, it is necessary to get members of the target population (the users and service-providers who will be asked to comply with MIAPA) to convey their knowledge in some systematic way.
At the start, the MIAPA design team may wish to gather knowledge in order to assess the breadth of phylogenetic studies and the types of information that must be represented to make results re-useable. This kind of knowledge might be obtained:
- directly from within a design team that includes experts with relevant experience
- from a search and analysis of the biomedical literature of phylogenetic studies
- from a knowledge-capture process such as a survey
A more exciting version of knowledge capture, suggested on the Supporting_MIAPA page of the Evolutionary Informatics working group, would be to stage a knowledge capture event at a conference. Good, et al showed that, in the context of a scientific conference, it only takes the promise of recognition and of modest prizes (t-shirts and coffee mugs) to stimulate large numbers of expert users to pass on their knowledge via a web form.
Once a standard is developed, MIAPA designers will need to evaluate it for its ability to represent actual phylogenetic analyses, and to support data re-use. Its important to understand that the purpose of the standard is not simply to represent data, but to make it re-usable. In particular, the evaluation should target users who wish to carry out meta-analyses of data in repositories.
MIAPA Software Infrastructure Working Group
This group focuses on the infrastructure needed for successful promulgation of the standard.
- software for composing MIAPA-compliant annotations from user input
- file parsing libraries in common bioinformatics languages (Perl, Java)
- software for semantic transformation between different MIAPA-compliant formats
- web service interfaces for composing, validating, translating MIAPA-compliant files
One of the things that the group may wish to do is to develop reference implementations, i.e., software implementations that provide a reference standard against which other implementations may be judged.
Depending on the organization of software teams, this group might be in a good position to get external funding for software development.
Certainly this group should take advantage of mechanisms such as Google Summer-of-Code for securing the talents of young programmers looking for training projects. There is an endless supply of students in computer science programs who are looking for coherent, useful, short-term projects on which to hone their skills and to build up a portfolio of accomplishments.
MIAPA Compliance Hackathons and workshops
If the MIAPA project is on a successful trajectory, with a good design as well as infrastructure support, this will create a demand for workshops to train users, as well as hackathons for software developers to improve on compliance.
- for research end-users:
- how to generate a MIAPA-compliant report
- how to submit a report to a repository
- for repository services providers:
- how to ensure that a database is a MIAPA compliant repository
Early in the process, NESCent could host smaller workshops. If MIAPA is to be widely adopted, there will be a need for large-scale workshops or training sessions that are more appropriate for the venue of a scientific conference.
Evaluating potential and actual benefits for NESCent participation
NESCent has limited resources and must decide wisely on which projects to support. Presumably NESCent chooses to support projects based on the extent to which several criteria apply:
- programmatic alignment: the project is an integrative project or infrastructure activity in the evolution research community
- importance: the results of the project will benefit the evolution research community
- unique NESCent role the project is unlikely to funded by traditional mechanisms, or it leverages unique NESCent resources
With regard to the first and third criteria, its clear that a MIAPA project is aligned programmatically with NESCent's goals, and that it leverages a distinctive combination of NESCent resources.
With regard to the second criterion, importance, the picture is less clear. Supporting the MIAPA project is important to the extent that it strengthens the progressive and self-policing nature of science by facilitating data re-use.
What is the potential size of this effect, and how can it be measured? According to Kumar, et al., apparently, thousands of phylogenetics papers are published each year. If users had access to a MIAPA standard with software making it easy to create compliant reports, and if there were a repository accepting such reports, then publishers (especially in the case of journals published by professional societies) might ask for compliance as a condition of publication. And then there would be thousands of MIAPA reports each year. The number of reports would be easy to count.
However, the key question is whether a compliant report facilitates data re-use. This effect might be measured by tracking citations. When other factors are controlled, published papers that are accompanied by a MIAPA-compliant report should be cited more often. It also may be possible to institute some mechanism for tracking citations to repository records. However, phylogenetics is not a particularly rapidly moving field. Published papers develop a citation record that reflects their quality and importance, but in a slow-moving field it may take several years to establish this record.