Browsing by Subject "Data Science"

Now showing 1 - 4 of 4

Choosing a “Source of Truth”: The Implications of using Self versus Interviewer Ratings of Interviewee Personality as Training Data for Language-Based Personality Assessments
(2022-12) Auer, Elena
Advancement in research and practice in the application of machine learning (ML) and natural language processing (NLP) in psychological measurement has primarily focused on the implementation of new NLP techniques, new data sources (e.g., social media), or cutting-edge ML models. However, research attention, particularly in psychology, has lacked a major focus on the importance of criterion choice when training ML and NLP models. Core to almost all models designed to predict psychological constructs or attributes is the choice of a “source of truth.” Models are typically optimally trained to predict something, meaning the choice of scores the models are attempting to predict (e.g., self-reported personality) is critical to understanding the constructs reflected by the ML or NLP-based measures. The goal of this study was to begin to understand the nuances of selecting a “source of truth” by identifying and exploring the impact of the methodological effects attributable to choosing a “source of truth” when generating language-based personality scores. There were four primary findings that emerged. First, in the context of scoring interview transcripts, there was a clear performance difference between language-based models predicting self-reported scores and interviewer ratings such that language-based models could predict interviewer ratings much better than self-reported ratings of conscientiousness. Second, this is some of the first explicit empirical evidence of the method effects that can occur in the context of language-based scores. Third, there are clear differences between the psychometric properties of language-based self-report and language-based interviewer rating scores and these patterns seemed to be the result of a proxy effect, where the psychometric properties of the language-based ratings mimicked the psychometric properties of the human ratings they were derived from. Fourth, while there was evidence of a proxy effect, language-based scores had slightly different psychometric properties compared to the scores they were trained on, suggesting that it would not be appropriate to fully assume the psychometric properties of language-based assessments based on the ratings the models were trained on. Ultimately, this study is one of the first attempts towards better isolating and understanding the modular effects of language-based assessment methods and future research should continue the application of psychometric theory and research to advances in language-based psychological assessment tools.
Finding Integrative Biomarkers from Biomedical Datasets: An application to Clinical and Genomic Data
(2015-08) Dey, Sanjoy
Human diseases, such as cancer, diabetes and schizophrenia, are inherently complex and governed by the interplay of various underlying factors ranging from genetic and genomic influences to environmental effects. Recent advancements in high throughput data collection technologies in bioinformatics have resulted in a dramatic increase in diverse data sets that can provide information about such factors related to diseases. These types of data include DNA microarrays providing cellular information, Single Nucleotide Polymorphisms (SNPs) providing genetic information, metabolomics data in terms of proteins and other metabolites, structural and functional brain data from magnetic resonance imaging (MRI), and electronic health records (EHRs) containing copious information about histo-pathological factors, demographic, and environmental effects. Despite their richness, each of these datasets only provides information about a part of the complex biological mechanism behind human diseases. Thus, effective integration of the partial information of any of these genomic and clinical data can help reveal disease complexities in greater detail by generating new data-driven hypotheses beyond the traditional hypotheses about biomarkers. In particular, integrative biomarkers, i.e., patterns of features that are predictive of disease and that go beyond the simple biomarkers derived from a single dataset, can lead to a customized and more effective approach to improving healthcare. This thesis focuses on addressing the key issues related to integrative biomarkers by developing new data mining approaches. One very important issue of biomarker discovery is that the models have to easily interpretable, i.e., integrative models have to be not only predictive of the disease, but also interpretable enough so that domain experts can infer useful knowledge from the obtained patterns. In one such effort to make models interpretable, domain information about disease relationships was used as prior knowledge during model development. In addition, a novel metric called I-score was proposed using medical literature to quantify the interpretability of the obtained patterns. Another key issue of integrative biomarker discovery is that there may be many potential relationships present among diverse datasets. For example, a very important types of relationship in biomarker discovery is interaction, which are those biomarkers spanning multiple datasets, whose combined features are more indicative of disease than the individual constituent factors. In particular, the individual effects of each type of factor on disease predisposition can be small and thus, remain undetected by most disease association techniques performed on individual datasets. Different types of relationships are explored and an association analysis based framework is proposed to discover them. The proposed framework is especially effective for discovering higher-order relationships, which cannot be found by the existing prominent integrative approaches for the biomarker discovery. When applied on real datasets collected from three different types of data from schizophrenic and normal subjects, this approach yielded significant integrated biomarkers which are biologically relevant. Disease heterogeneity creates further issues for integrative biomarker discovery, biomarkers obtained from clinicogenomic studies may not be applicable to all patients in the same degree, i.e., a disease consist of multiple subtypes, each occurring in different subpopulations. Some potential reasons responsible for disease heterogeneity are different pathways playing different roles in the same disease and confounding factors such as age, ethnicity and race, or genetic predisposition, which can be available in rich EHR data. Most biomarker discovery techniques use full space model development techniques, i.e., they assess the performance of biomarkers on all patients without finding the distinct subpopulations. In this thesis, more customized models were built depending on patient\'s characteristics to handle disease heterogeneity. In summary, several data mining techniques developed in this thesis advance the state-of-the art in integration of diverse biomedical datasets. Moreover, their applications on large-scale EHR yield significant discoveries, which can ultimately lead to generating new data-driven hypotheses for inferring meaningful information about complex disease mechanism.
Oral history interview with Daniel (Dan) Boley
(Charles Babbage Institute, 2024-01-30) Boley, Daniel
This interview was conducted by CBI for CS&E, a multi-year project extending from the 50th Anniversary of the University of Minnesota Computer Science Department (now Computer Science and Engineering, CS&E). The oral history begins with Boley’s early interests, undergraduate work at Cornell, and completing a doctorate at Stanford University. It explores the Computer Science Department environment in the 1980s, its administration, Boley’s teaching, and research in various areas of numerical analysis, data science, and machine learning. This includes his work, often allowing graduate students to follow their interests, in applications such as health/medicine, navigation, etc. He discusses this work with Vipin Kumar, collaborations across departments in the College of Science and Engineering, and with other colleges such as the College of Liberal Arts, and the discussions and debates, and launch of the immediately popular and fast-growing Data Science Program.
Using Social Media Data for the Common Good
(2019-09-12) King, Gary; Jacobs, Lawrence R.; McGeveran, William

University Digital Conservancy

Browse by Subject

Browsing by Subject "Data Science"