Resources

Resources

ConText Annotations

The dataset consists of de-identified, randomized sentences from 120 clinical reports including discharge summaries, emergency department, echocardiogram, surgical pathology, operative, and radiology reports from the University of Pittsburgh Medical Center. The dataset contains annotations for problem mentions describing whether each problem mentions negation, temporality, and experiencer. These annotations can be used to test NegEx and ConText algorithm output.

MTSamples

The dataset consists of over 2,500 de-identified clinical reports from the MTSamples.com website. The notes contain surrogates replacing protected health information (PHI) elements and the character offsets for researchers that would like to develop and test de-identification algorithms. The MTSamplesCollaborativeCommunity is a private community in the MIDAS repository on the iDASH infrastructure. To access the dataset, please follow the instructions on the iDASH website for setting up a MIDAS account using the link below.

ShARe/CLEF eHealth 2013 Challenge - Tasks 1 and 2

The dataset for Tasks 1 and 2 consists of de-identified clinical free-text notes authored in the ICU setting including discharge summaries, ECG reports, echocardiogram reports, and radiology reports from the MIMIC II database, version 2.5.
The dataset consists of 200 training set notes, and 100 test set notes. Task 1 dataset contains disease/disorder mentions generated by 2 medical coders; task 2 contains acronym/abbreviation mention annotations generated by nursing professionals, NLP researchers and biomedical informaticians. To access the dataset, please follow the instructions on the ShARe website for setting up a physionet account using the link below.