Browsing by Subject "Natural Language Processing"
Now showing 1 - 9 of 9
- Results Per Page
- Sort Options
Item Choosing a “Source of Truth”: The Implications of using Self versus Interviewer Ratings of Interviewee Personality as Training Data for Language-Based Personality Assessments(2022-12) Auer, ElenaAdvancement in research and practice in the application of machine learning (ML) and natural language processing (NLP) in psychological measurement has primarily focused on the implementation of new NLP techniques, new data sources (e.g., social media), or cutting-edge ML models. However, research attention, particularly in psychology, has lacked a major focus on the importance of criterion choice when training ML and NLP models. Core to almost all models designed to predict psychological constructs or attributes is the choice of a “source of truth.” Models are typically optimally trained to predict something, meaning the choice of scores the models are attempting to predict (e.g., self-reported personality) is critical to understanding the constructs reflected by the ML or NLP-based measures. The goal of this study was to begin to understand the nuances of selecting a “source of truth” by identifying and exploring the impact of the methodological effects attributable to choosing a “source of truth” when generating language-based personality scores. There were four primary findings that emerged. First, in the context of scoring interview transcripts, there was a clear performance difference between language-based models predicting self-reported scores and interviewer ratings such that language-based models could predict interviewer ratings much better than self-reported ratings of conscientiousness. Second, this is some of the first explicit empirical evidence of the method effects that can occur in the context of language-based scores. Third, there are clear differences between the psychometric properties of language-based self-report and language-based interviewer rating scores and these patterns seemed to be the result of a proxy effect, where the psychometric properties of the language-based ratings mimicked the psychometric properties of the human ratings they were derived from. Fourth, while there was evidence of a proxy effect, language-based scores had slightly different psychometric properties compared to the scores they were trained on, suggesting that it would not be appropriate to fully assume the psychometric properties of language-based assessments based on the ratings the models were trained on. Ultimately, this study is one of the first attempts towards better isolating and understanding the modular effects of language-based assessment methods and future research should continue the application of psychometric theory and research to advances in language-based psychological assessment tools.Item Detecting Cognitive Impairment from Language and Speech for Early Screening of Alzheimer's Disease Dementia with Interpretable Transformer-Based Language Models(2024-05) Li, ChangyeAlzheimer’s disease (AD) is a neurodegenerative disorder that affects the use of speech and language and is diffcult to diagnose in its early stages. Neural language models (NLMs) have delivered impressive performance on the task of discriminating between language produced by cognitively healthy individuals, and those with AD. As artificial neural networks (ANNs) grow in complexity, understanding their inner workings becomes increasingly challenging, which is particularly important in healthcare applications. The intrinsic evaluation metrics of autoregressive NLMs (e.g., predicting the next token given the context), such as perplexity (PPL), reflecting a model’s “surprise” at novel input, and have been widely used to understand the behavior of NLMs. As an alternative to fitting model parameters directly, this thesis proposes a novel method by which a pre-trained transformer-based NLM, GPT-2, is paired with an artificially degraded version of itself, GPT-D, to compute the ratio between these two models’ PPLs on language from cognitively healthy and impaired individuals. This technique approaches state-of-the-art (SOTA) performance on text data from a widely used “Cookie Theft” picture description task, and unlike established alternatives also generalizes well to spontaneous conversations, the degraded models generate text with characteristics known to be associated with AD, demonstrating the induction of dementia-related linguistic anomalies. The novel attention head ablation method employed in this thesis exhibits properties attributed to the concepts of cognitive and brain reserve in human brain studies, which postulate that people with more neurons in the brain and more effcient processing are more resilient to neurodegeneration. The results show that larger GPT-2 models require a disproportionately larger share of attention heads to be masked/ablated to display degradation of similar magnitude to masking in smaller models. To realize their benefits for assessment of mental status, transformer-based NLMs require verbatim transcriptions of speech from patients. While such models have shown promise in detecting cognitive impairment from language samples, the feasibility of deploying such automated tools in large-scale clinical settings depends on the ability to reliably capture and transcribe the speech input. Currently available automatic speech recognition ASR solutions have improved dramatically over the last few years but are still not perfect and can have high error rates on challenging speech, such as speech from audio data with sub-optimal recording quality. One of the key questions for successfully applying ASR technology for clinical applications is whether imperfect transcripts generated by ASR provide sucient information for downstream tasks to operate at an acceptable level of accuracy. This thesis examines the relationship between the errors produced by several transformer-based ASR systems and their impact on downstream dementia classification. One of the key findings is that ASR errors may provide important features for this downstream classification task, resulting in better performance compared to using manual transcripts. In summary, this thesis is a step toward a better understanding of the relationships between the inner workings of generative NLMs, the language that they produce, and the deleterious e↵ects of dementia on human speech and language characteristics. The probing methods also suggest that the attention mechanism in transformer models may present an analogue to the notions of cognitive and brain reserve and could potentially be used to model certain aspects of the progression of neurodegenerative disorders and aging. Additionally, the results presented in this thesis suggest that the ASR models and the downstream classification models react to acoustic and linguistic dementia manifestations in systematic and mutually synergistic ways, which would have significant implications for use of ASR technology. This line of research enables the automated analysis of speech collected from patients at least in the dementia screening settings, and it has the potential to expand to a variety of other clinical applications as well in which both language and speech characteristics are affected.Item Hate Speech Detection In Twitter: A Selectively Trained Ensemble Method(2020-05) Houston, JacksonThis thesis tests classification models from Natural Language Processing and Machine learning in the task of identifying hate speech. We tested on multiple annotated data sets (Davidson et al. 2017) of tweet data labeled as hate speech, offensive speech, both, or neither. Hate speech has become an unavoidable topic in the current social media environment due to poorly monitored comment sections and news feeds. With that, studies showing the negative affects that it brings to people’s well-being have also begun to surface (Gelber and McNamara 2015). Therefore, being able to identify hate speech accurately and precisely has grown in importance. Hate speech is often contextual, subjective, and a matter of opinion which makes creating an accurate model of such speech all the more difficult. We have found that using an ensemble method of a classic Naive Bayes classifier (Pedregosa et al. 2019c), Random Forest (Pedregosa et al. 2019b), K-Means (Pedregosa et al. 2019d), and Bernoulli (Pedregosa et al. 2019a) performed better than similar studies in precision, accuracy, recall, and f-score (Malmasi and Zampieri 2018). The ensemble performed better than using the strongest of the individual models, Random Forest, by a small but useful margin. We believe this to be due to the nuanced nature and context behind hate speech being more than one model can fully encompass. In addition to the ensemble strategy, training on data which was labeled as ‘clean’ (not hate speech or offensive) or labeled ‘dirty’ (hate speech) with higher confidence ratings increased the precision of our model by around 10% in some cases when compared to training on the complete data set including the tweets which have a blurred sentiment such as offensive but not hate speech tweets. Having an accurate and precise model such as this will allow organizations to protect their users from such language to prevent the negative effects of hate speech. Additionally, it will allow us to identify more hate speech tweets or statements to have more data to research in the future and find deeper trends than simply the tweet text, such as replies, retweets, and user biographies.Item Improvements to a Speech Repair Parser(2016-04) Exley, AndrewParsing is a common task for speech-recognition systems, but many parsers ignore the possibility of speech errors and repairs, which are very common in conversational language. The goal of this thesis is to examine a parsing system that can handle these occurrences and improve its performance by incorporating systems that use linguistic knowledge about speech errors and repairs. The basis for this thesis is a system for incremental parsing. The thesis shows additions that can be made to that system to allow for detection of speech errors and repairs. That is shown to be an improvement on previous incremental systems. An extension to the system is introduced which incorporates ideas about human short term memory and its relationship to speech errors. The system is then tested with many different configurations. Finally, the thesis concludes with a summary and discussion of the various results and lays out possible avenues for future work.Item Improving Search via Named Entity Recognition in Morphologically Rich Languages – A Case Study in Urdu(2018-02) Riaz, KashifSearch is not a solved problem even in the world of Google and Bing's state of the art engines. Google and similar search engines are keyword based. Keyword-based searching suffers from the vocabulary mismatch problem -- the terms in document and user's information request don't overlap. For example, cars and automobiles. This phenomenon is called synonymy. Similarly, the user's term may be polysemous -- a user is inquiring about a river's bank, but documents about financial institutions are matched. Vocabulary mismatch exacerbated when the search occurs in Morphological Rich Language (MRL). Concept search techniques like dimensionality reduction do not improve search in Morphological Rich Languages. Names frequently occur news text and determine the "what," "where," "when," and "who" in the news text. Named Entity Recognition attempts to recognize names automatically in text, but these techniques are far from mature in MRL, especially in Arabic Script languages. Urdu is one the focus MRL of this dissertation among Arabic, Farsi, Hindi, and Russian, but it does not have the enabling technologies for NER and search. A corpus, stop word generation algorithm, a light stemmer, a baseline, and NER algorithm is created so the NER-aware search can be accomplished for Urdu. This dissertation demonstrates that NER-aware search on Arabic, Russian, Urdu, and English shows significant improvement over baseline. Furthermore, this dissertation highlights the challenges for researching in low-resource MRL languages.Item Language Evolves, so should WordNet - Automatically Extending WordNet with the Senses of Out of Vocabulary Lemmas(2017-05) Rusert, JonathanThis thesis provides a solution which finds the optimal location to insert the sense of a word not currently found in lexical database WordNet. Currently WordNet contains common words that are already well established in the English language. However, there are many technical terms and examples of jargon that suddenly become popular, and new slang expressions and idioms that arise. WordNet will only stay viable to the degree to which it can incorporate such terminology in an automatic and reliable fashion. To solve this problem we have developed an approach which measures the relatedness of the definition of a novel sense with all of the definitions of all of senses with the same part of speech in WordNet. These measurements were done using a variety of measures, including Extended Gloss Overlaps, Gloss Vectors, and Word2Vec. After identifying the most related definition to the novel sense, we determine if this sense should be merged as a synonym or attached as a hyponym to an existing sense. Our method participated in a shared task on Semantic Taxonomy Enhancement conducted as a part of SemeEval-2016 are fared much better than a random baseline and was comparable to various other participating systems. This approach is not only effective it represents a departure from existing techniques and thereby expands the range of possible solutions to this problem.Item Natural Language Processing Methods to Automatically Parse Eligibility Criteria in Dietary Supplements Clinical Trials(2020-08) Bompelli, AnushaDietary supplements (DSs) have been widely used in the U.S. and evaluated in clinical trials as potential interventions for various diseases. However, many clinical trials face challenges in recruiting enough eligible patients in a timely fashion, causing delays or even early termination. Using electronic health records to find eligible patients who meet clinical trial eligibility criteria has been shown as a promising way to assess recruitment feasibility and accelerate the recruitment process. Natural Language Processing (NLP) techniques have been used extensively to extract concepts from the clinical trial eligibility criteria. However, a significant obstacle is identifying an efficient Named Entity Recognition (NER) system to parse the clinical trial eligibility criteria. The study comprises of two parts. In the first part of the study, the objective was to (1) understand data elements associated with DS trials’ eligibility criteria and assess if they can be mapped to OMOP Common Data Model (CDM); (2) develop and evaluate NLP methods, especially deep learning-based models, for extracting eligibility criteria data elements. We analyzed the eligibility criteria of 100 randomly selected DS clinical trials and identified both computable and non-computable criteria. We mapped annotated entities to OMOP Common Data Model (CDM) with novel entities (e.g., DS). We also evaluated a deep learning model (Bi-LSTM-CRF) for extracting these entities on CLAMP platform, with an average F1 measure of 0.601. This study shows the feasibility of automatic parsing of the eligibility criteria following OMOP CDM for future cohort identification. In the second part of the study, the objective was to examine the performance of standard open-source clinical NLP systems for the task of Named Entity Recognition (NER) for a corpus outside of the domain for which these systems were developed. we used NLP-ADAPT (Artifact Discovery and Preparation Toolkit) to compare existing biomedical NLP systems (BiomedICUS, CLAMP, cTAKES and MetaMap) and their Boolean ensemble to identify entities of the eligibility criteria of 150 randomly selected Dietary Supplement (DS) clinical trials. We created a custom mapping of the gold standard annotated entities to UMLS semantic types to align with annotations from each system. All systems in NLP-ADAPT used their default pipelines to extract entities based on our custom mappings. The systems performed reasonably well in extracting UMLS concepts belonging to the semantic types Disorders and Chemicals and Drugs. Among all systems, cTAKES was the highest performing system for Chemicals and Drugs and Disorders semantic groups and BioMedICUS was the highest performing system for Procedures, Living Beings, Concepts and Ideas, and Devices. Whereas, the Boolean ensemble outperformed individual systems. This study sets a baseline that can be potentially improved with modifications to the NLP-ADAPT pipeline.Item Organizational Environmental Sustainability in Asian Countries: Assessment and Nomological Network(2022-06) Wang, YileiThis dissertation recognizes that organizations and their employees are vital for addressing climate change and promoting environmental sustainability. It presents a comprehensive examination of organizations’ environmental sustainability performance for far and southeastern Asian organizations, focusing on actions (i.e., what organizations do). Five studies are presented which adopt an initiative-based approach to assess organizational environmental performance, develop NLP methods to content analyze environmental sustainability initiatives, and validate the assessment by examining its nomological network. The first study utilized natural language processing (NLP) and machine learning (ML) to facilitate the human coding process in the content-analysis of environmental sustainability initiatives. Study two implemented a human-machine hybrid approach to quantify organizations’ environmental performance by content analyzing their initiatives reported in their 2018 corporate social responsibility (CSR) reports. Environmental performance of 859 Asian organizations were compared across eleven major Asian countries/regions (China Mainland, Hong Kong, Indonesia, Japan, Malaysia, Philippines, Singapore, South Korea, Taiwan, Thailand, and Vietnam). An exploratory factor analysis was conducted to examine the factor structure of company environmental performance in this region, which suggested a bi-factor structure with a general factor and three group factors. Study three examined and demonstrated the convergent and discriminant validities of companies’ initiative-based environmental performance with third-party Environment, Social, and Governance (ESG) ratings. Study four investigated the relationship between top management (CEO and board of directors) characteristics and organizations’ environmental performance. Finally, study five analyzed the relationships between organization environmental performance and financial performance. Together, these studies constitute the first in depth, psychologically informed, and quantitatively sophisticated investigations of pro-environmental organizational actions in Asia. The findings stand to inform the literature of organizational environmental sustainability in general, and in Asian organizations in particular.Item TRUST: Clinical Text Retrieval and Use towards Scientific Rigor and Transparent Process(2021-12) Fu, SunyangRapid proliferation and adoption of the electronic health record (EHR) has led to seamless integration of clinical research into practice, and has facilitated healthcare decision-making through enabling accurate and timely supply of health information. Leveraging this supply of information, the Institute of Medicine envisioned the concept of continuously Learning Health Systems (LHS) in 2007, with the aim of first deriving knowledge from routine care data and then translating such knowledge into evidence-based clinical practice. To achieve such a vision, it is critical to have a robust data and informatics infrastructure with the following properties: 1) high-throughput and real-time methods for data retrieval, extraction, and analysis, 2) transparent and reproducible processes to ensure scientific rigor in clinical research, and 3) implementable and generalizable scientific findings. There are many approaches to the derivation of knowledge from care data, one of which is through the use of chart review: a common, albeit manual, approach to practice-based knowledge discovery. Traditionally, chart review is performed by manually reviewing patient medical records. As a significant portion of clinical information is represented in textual format, this manual approach can be time-consuming and costly. With the implementation of EHRs, chart review can be automated by extracting data from structured fields systematically and leveraging natural language processing (NLP) techniques to extract information from text. Rigorous development and evaluation of NLP algorithms for a specific chart review task requires, however, data abstraction and annotation (i.e., the manual creation of a gold standard clinical corpus to evaluate the developed NLP algorithm). In EHR-based settings, there is, however, a lack of standard processes or best practices for creating such a corpus due to the heterogeneity of institutional EHR systems and process variation between single and multi-site research settings. Recent advancement in healthcare AI identifies the need for detailed data provenance for data used in the training and validation of AI models. Secondary use of EHR for clinical research leveraging AI technologies such as NLP therefore requires the documentation of the provenance information relating to the process used for the retrieval and organization of the raw data used as well as the extraction and annotation of training data. We thus define this process as clinical Text Retrieval and Use towards Scientific rigor and Transparent (TRUST) process. As EHR-based research becomes increasingly integrated into clinical care, it is important to have a systematic understanding of the TRUST process, its corresponding utilization when developing informatics tools and methods, as well as its overall impact on research reproducibility. In this work, we propose a multi-phase method to develop informatics frameworks and best practices to ensure reproducible TRUST processes for single and multi-site studies. In the following chapters, we propose: 1) a definition of reproducibility in the context of the secondary use of EHRs, 2) methods to assess various levels of data heterogeneity caused by differing EHR systems and inter-institutional variations, 3) approaches to examine the implication of data heterogeneity to reproducibility, 4) steps to develop frameworks, best practices, and reporting standards conforming to the TRUST process, and 5) an application of the TRUST process in a real-world case study.