Word Sense Disambiguation is the task of automatically identifying the appropriate sense (or concept) of an ambiguous word, for example, the term 'cold' could refer to the temperature or a virus depending on the context in which it is used. Not being able to identify the intended concept of an ambiguous word negatively impacts the accuracy of biomedical applications such as medical coding and indexing which are becoming essential in the biomedical and clinical world with the push towards electronic medical records and the growing amount of information that is available to biomedical researchers and clinicians. This dissertation focuses on disambiguating ambiguous words in biomedical text.
This dissertation presents two methods, K-CUI and A-CUI, that can disambiguate ambiguous terms in any biomedical text using information from the Unified Medical Language System (UMLS). K-CUI explores the use of Concept Unique Identifiers (CUIs) as assigned by MetaMap, as features for a supervised learning method for word sense disambiguation. It also investigates four techniques to reduce the noise in the feature set by restricting which CUIs to include. The first technique is windowing, whose results show that in biomedical text indicative CUIs are highly localized. The second is a frequency cutoff, whose results show that when a dataset contains a high majority concept, the features that only occur a few times are essential in disambiguating the minority concepts. The third is a MetaMap Indexing cutoff, whose results show that word concepts are correlated with the topical information describing an instance. The fourth is a semantic similarity cutoff, whose results show in biomedical text, indicative features have a high semantic similarity with at least one of the possible concepts of the ambiguous word.
A-CUI is a knowledge-based method that uses information from the UMLS and MetaMap mapped text to represent the context of the possible concepts of an ambiguous word. It investigates three types of contextual representations. The first uses the concept's definition in the UMLS, whose results show that the context used with the words the definition can be used to represent its context of the concept. The second uses the preferred and associated terms from the UMLS, whose results show that the terms themselves do not provide enough contextual information to disambiguate between the possible concepts of a target word. The third uses the words surrounding the concept in MetaMap mapped text, whose results show that the information provided by MetaMap is distinct enough to distinguish between the possible concepts for disambiguation purposes.
K-CUI and A-CUI are evaluated using the NLM-WSD dataset which consists of Medline abstracts. Previous work in this area have also evaluated their methods using the same dataset and in some cases tailored their methods to work only on Medline abstracts. Identifying the correct concept of an ambiguous term in Medline abstracts is a significant problem but the advantage of K-CUI and A-CUI though is that they are portable systems that can disambiguate terms in any biomedical text, unlike previous methods that are limited to only Medline abstracts.
There has also been previous work that determines the correct concept of a target word by first identifying the target words semantic type which is a broad categorization of a concept. After the semantic type of the ambiguous words is identified, then the correct concept is identified based on its semantic type. The assumption is that each possible concept of a target word has a unique semantic type. If the possible concepts have the same semantic type this method cannot distinguish between them; A-CUI and K-CUI do not have the limitation. Also, identifying the semantic type of a target word is a simpler problem than identifying the concept because semantic types are a coarser grained categorization than CUIs which makes them easier to assign.
University of Minnesota Ph.D. dissertation. September 2009. Major: Computer Science. Advisors: John Vincent Carlis, Ted Pedersen. 1 computer file (PDF); x, 234 pages, appendices A-C.
McInnes, Bridget Thomson.
Supervised and knowlege-based methods for disambiguating terms in biomedical text using the UMLS and MetaMap..
Retrieved from the University of Minnesota Digital Conservancy,
Content distributed via the University of Minnesota's Digital Conservancy may be subject to additional license and use restrictions applied by the depositor.