README:: De-identified Data Set, Symbols 
Version 1 (October 30, 2012)
Contact information: 
	Natural Language Processing/Information Extraction (NLP/IE) Program 
	Email: nlp-ie@umn.edu

---------------------------------------
Among the most frequently occurring symbols in clinical text, four common non-alphanumeric symbols ('+', '-', '/', and '#') were selected. For each, 1000 instances were de-identified and the senses (meanings) were manually annotated and are available for researchers. 

Reference of original study:
Automated non-alphanumeric symbol resolution in clinical texts. 
Moon S, Pakhomov S, Ryan J, Melton GB. 
AMIA Annu Symp Proc. 2011;2011:979-86. Epub 2011 Oct 22. 
PubMed PMID: 22195157; PubMed Central PMCID: PMC3243158.

Samples were de-identified using the safe harbor method. The basic format of identification codes is as _%#IDENTIFIER#%_ in data. Identifiers were replaced with the following identification codes:

Identifiers 1: Name
Identification codes: _%#NAME#%_

Identifiers 2: Street address (Geographic subdivisions)
Identification codes: _%#STREET#%_

Identifiers 3: City (Geographic subdivisions)
Identification codes: _%#CITY#%_

Identifiers 4: County (Geographic subdivisions)
Identification codes: _%#COUNTY#%_

Identifiers 5: Precinct (Geographic subdivisions)
Identification codes: _%#PRECINCT#%_

Identifiers 6: All geographic subdivisions smaller than a State (Geographic subdivisions)
Identification codes: _%#ADDRESS#%_

Identifiers 7: Zip code (Geographic subdivisions), The geographic unit formed by combining all zip codes with the same three initial digits contains more than 20,000 people
Identification codes: 55455 => _%#55400#%_

Identifiers 8: Zip code (Geographic subdivisions), All such geographic units containing 20,000 or fewer people
Identification codes: _%#00000#%_

Identifiers 9: Dates for under 89 (keep real year), All elements of dates (except year) for dates directly related to an individual, including birth date, admission date, discharge date, date of death
Identification codes: 02/07/2000 => _%#DDMM2000#%_ 
Identification codes: Feb, 07, 2000 => _%#MM#%_, _%#DD#%_, _%#2000#%_
Identification codes: 10/17 => _%#MMDD#%_

Identifiers 10: Dates for over 89
Identification codes: _%#DDMM1914#%_
Identification codes: Feb, 07, 1997 => _%#MM#%_, _%#DD#%_, _%#1914#%_

Identifiers 11: Telephone numbers
Identification codes: _%#TEL#%_

Identifiers 12: Fax numbers
Identification codes: _%#FAX#%_

Identifiers 13: Electronic mail addresses
Identification codes: _%#EMAIL#%_

Identifiers 14: Social security numbers
Identification codes: _%#SSN#%_

Identifiers 15: Medical record numbers
Identification codes: _%#MRN#%_

Identifiers 16: Health plan beneficiary numbers
Identification codes: _%#HPBN#%_

Identifiers 17: Account numbers
Identification codes: _%#ACCOUNTN#%_

Identifiers 18: Certificate/license numbers
Identification codes: _%#LN#%_

Identifiers 19: Vehicle identifiers and serial numbers, including license plate numbers
Identification codes: _%#VN#%_

Identifiers 20: Device
Identification codes: _%#DEVICE#%_

The de-identified symbol sentence data set pipe delimitated '|'. All whitespaces in sentences were replaced with spaces.
Column 1: The targeted symbol
Column 2: Sense, the meaning of the targeted symbol
Column 3: The position of the targeted symbol in the given sentence. The sentence starts from position 0.
Column 4: De-identified sample including the targeted symbol