The ODIE Toolkit - Software for Information Extraction and Biomedical Ontology Development

The ODIE Toolkit - Software for Information Extraction and Biomedical Ontology Development (2007 - 2011)

We propose a program of research with two interlocking, foundational goals: (1) to develop and evaluate software for information extraction from clinical text corpora using existing Open Biomedical Ontologies (OBO) and (2) to develop and evaluate software for enrichment of existing biomedical ontologies from clinical text corpora. As a result of our work we will deliver the Ontology Development and Information Extraction Toolkit (ODIE) - a set of software components integrated with GATE, Prot¿g¿ and LexGrid, that will assist researchers and ontology developers in performing these tasks. As a testbed for our work, we will focus mainly on the National Cancer Institute Thesaurus - an existing OBO ontology, but will develop many of our components to be generalizable to other OBO ontologies. We have chosen the domain of hematopathology as a test case because of the rich and varied source of clinical documents, and the potential for our software to advance translational biomedical research in this area. However the majority of the components that we develop will be domain-neutral and will generalize to other areas within and outside of Oncology. The work we propose is significant for three contributions. First, we will develop novel methods or modify existing methods for accomplishing information extraction and ontology enrichment and we will evaluate the performance of these alternatives. Second, we will develop and disseminate generic software resources for performing these tasks, which leverage the National Center for Biomedical Ontology supported tools. Third, we will contribute to the development of existing OBO ontologies. The results of this work will use OBO ontologies in fundamental ways to advance biomedicine. This grant propose to develop a set of computer tools to assist researchers in (1) extracting meaning and codifying medical documents, and (2) building formal representations of knowledge from those documents. This work would benefit the general public by increasing the speed and efficiency of determining what information is in a particular medical document and allowing automated processing of large numbers of documents. Additionally, the project would contribute to the software for developing other applications by helping researchers build more comprehensive ontologies. The results of this work may benefit both medical research and patient care.

Selected Publications, Papers, and Presentations

  • Zheng J, Chapman WW, Crowley RS, Savova GK. Coreference resolution: a review of general methodologies and applications in the clinical domain. J Biomed Inform. 2011 Dec;44(6):1113-22. doi: 10.1016/j.jbi.2011.08.006. Epub 2011 Aug 12.
  • Zheng J, Chapman WW, Miller TA, Lin C, Crowley RS, Savova GK. Coreference resolution: A system for coreference resolution for the clinical narrative. J Am Med Inform Assoc. 2012 Jul-Aug;19(4):660-7. doi: 10.1136/amiajnl-2011-000599. Epub 2012 Jan 31.
  • Chapman WW, Savova GK, Zheng J, Tharp M, Crowley RS. Anaphoric reference in clinical reports: characteristics of an annotated corpus. J Biomed Inform. 2012 Jun;45(3):507-21. doi: 10.1016/j.jbi.2012.01.010. Epub 2012 Feb 9.
  • PI: 
    Rebecca Crowley (University of Pittsburgh)
    Wendy Chapman (Co-I)