Browsing by Subject "Statistics"
Now showing 1 - 20 of 58
- Results Per Page
- Sort Options
Item Adaptive model selection in linear mixed models.(2009-08) Zhang, BoLinear mixed models are commonly used models in the analysis of correlated data, in which the observed data are grouped according to one or more clustering factors. The selection of covariates, the variance structure and the correlation structure is crucial to the accuracy of both estimation and prediction in linear mixed models. Information criteria such as Akaike's information criterion, Bayesian information criterion, and the risk inflation criterion are mostly applied to select linear mixed models. Most information criteria penalize an increase in the size of a model through a fixed penalization parameter. In this dissertation, we firstly derive the generalized degrees of freedom for linear mixed models. A resampling technique, data perturbation, is employed to estimate the generalized degrees of freedom of linear mixed models. Further, based upon the generalized degrees of freedom of linear mixed models, we develop an adaptive model selection procedure with a data-adaptive model complexity penalty for selecting linear mixed models. The asymptotic optimality of the adaptive model selection procedure in linear mixed models is shown over a class of information criteria. The performance of the adaptive model selection procedure in linear mixed models is studied by numerical simulations. Simulation results show that the adaptive model selection procedure outperforms information criteria such as Akaike's information criterion and Bayesian information criterion in selecting covariates, the variance structure and the correlation structure in linear mixed models. Finally, an application to diabetic retinopathy is examined.Item Bayesian spatiotemporal modeling using spatial hierarchical priors with applications to functional magnetic resonance imaging(2015-01) Bezener, Martin AndrewFunctional magnetic resonance imaging (fMRI) has recently become a popular tool for studying human brain activity. Despite its widespread use, most existing statistical methods for analyzing fMRI data are problematic. Many methodologies oversimplify the problem for the sake of computational efficiency, often not providing a full statistical model as a result. Other methods are too computationally inefficient to use on large data sets. In this paper, we propose a Bayesian method for analyzing fMRI data that is computationally efficient and provides a full statistical model.Item Item Computational issues in using Bayesian hierarchical methods for the spatial modeling of fMRI data.(2010-08) Lee, Kuo-JungOne of the major objectives of fMRI (functional magnetic resonance imaging) studies is to determine which areas of the brain are activated in response to a stimulus or task. To make inferences about task-specific changes in underlying neuronal activity, various statistical models are used such as general linear models (GLMs). Frequentist methods assessing human brain activity using data from fMRI experiments rely on results from the theory of Gaussian random fields. Such methods have several limitations. The Bayesian paradigm provides an attractive framework for making inference using complex models and bypassing the multiple comparison problems. We propose a Bayesian model which not only takes into account the complex spatio-temporal relationships in the data while still being computationally feasible, but gives a framework for addressing other interesting questions related to how the human brain works. We study the properties of this approach and demonstrate its performance on simulated and real examples.Item Computational Modeling For The Vertical Bridgman Growth Of Babrcl:Eu Crystal(2020-03) Zhang, ChangIn recent years, many new scintillator crystals for X-ray or gamma-ray detection have been discovered. They have great potential to be used in security devices or medical imaging devices. However, there are a couple challenges need to be overcome before these scintillator crystals can be commercialized. Firstly, the internal physical processes during the growth of these crystals are hard to be observed, making it difficult to control and optimize the processes. Secondly, cracking is a main issue that hinders the growth of high quality, large size scintillator crystals. Slow cooling, a conventional way to reduce thermo-elastic stress, fails to completely prevent cracking in scintillator crystals. In this thesis, we, together with our experimental collaborators, will demonstrate that computational modeling and advanced experimental tools can help researchers overcome these challenges and manufacture high quality, large size scintillator crystals. BaBrCl:Eu crystal and vertical Bridgman method are chosen as the candidate material and candidate crystal growth method in this thesis. Tremsin and coworkers developed a neutron imaging system to observe the vertical Bridgman growth process of scintillator crystals. Their measurements provided a direct observation of segregation and interface shape within a vertical gradient freeze system (VGF) that is large enough to exhibit the complex interplay of heat transfer, fluid flow, segregation, and phase change characteristic of an industrially relevant melt-growth process. We have applied continuum models to simulate a VGF growth process of BaBrCl:Eu crystal conducted in the neutron imaging system. Our models provide a rigorous framework in which to understand the mechanisms that are responsible for the complicated evolution of interface shape and dopant distribution in the growth experiment. We explain how a transition in the solid/liquid interface shape from concave to convex is driven by changes in radial heat transfer caused by furnace design. We also provide a mechanistic explanation of how dynamic growth conditions and changes of the flow structure in the melt result in complicated segregation patterns in this system. Onken and coworkers used neutron diffraction to measure the crystal structure evolution of BaBrCl:Eu at different temperature levels. Their results showed that the chemical stress induced by the lattice mismatch between Eu dopant and BaBrCl is responsible for the cracking of BaBrCl:Eu crystal during cooling process. We developed finite element models to analyze the chemical stress in BaBrCl:Eu crystal under different growth conditions based on the study of Onken and coworkers. To our knowledge, these are the first computations for chemical stress in bulk crystal growth process. Our results showed that the melt/crystal interface shape and the associated melt flows have a strong influence on the radial segregation outcome of Eu, which determines the chemical stress profile in the crystal. Counterintuitively, growing this crystal at slow growth rates can lead to high stress levels and tensile stress states near the cylindrical surface that promote cracking. However, a slightly faster growth rate can produce Eu radial concentration gradients that provide a protective, compressive force layer that would suppress cracking. Our results show that the chemical stress could be tailored by designing appropriate interface shapes and melt flows.Item Data and code supporting: Simulations corroborate telegraph model predictions for the extension distributions of nanochannel confined DNA(2019-08-12) Bhandari, Aditya Bikram; Dorfman, Kevin D; dorfman@umn.edu; Dorfman, Kevin D; DorfmanHairpins in the conformation of DNA confined in nanochannels close to their persistence length cause the distribution of their fractional extensions to be heavily left skewed. A recent theory rationalizes these skewed distributions using a correlated telegraph process, which can be solved exactly in the asymptotic limit of small but frequent hairpin formation. Pruned-enriched Rosenbluth method simulations of the fractional extension distribution for a channel-confined wormlike chain confirm the predictions of the telegraph model. Remarkably, the asymptotic result of the telegraph model remains robust well outside the asymptotic limit. As a result, the approximations in the theory required to map it to the polymer model and solve it in the asymptotic limit are not the source of discrepancies between the predictions of the telegraph model and experimental distributions of the extensions of DNA during genome mapping. The agreement between theory and simulations motivates future work to determine the source of the remaining discrepancies between the predictions of the telegraph model and experimental distributions of the extensions of DNA in nanochannels used for genome mapping.Item Dimension reduction and prediction in large p regressions(2009-05) Adragni, Kofi PlacidA high dimensional regression setting is considered with p predictors X=(X1,...,Xp)T and a response Y. The interest is with large p, possibly much larger than n the number of observations. Three novel methodologies based on Principal Fitted Components models (PFC; Cook, 2007) are presented: (1) Screening by PFC (SPFC) for variable screening when p is excessively large, (2) Prediction by PFC (PPFC), and (3) Sparse PFC (SpPFC) for variable selection. SPFC uses a test statistic to detect all predictors marginally related to the outcome. We show that SPFC subsumes the Sure Independence Screening of Fan and Lv (2008). PPFC is a novel methodology for prediction in regression where p can be large or larger than n. PPFC assumes that X|Y has a normal distribution and applies to continuous response variables regardless of their distribution. It yields accuracy in prediction better than current leading methods. We adapt the Sparse Principal Components Analysis (Zou et al., 2006) to the PFC model to develop SpPFC. SpPFC performs variable selection as good as forward linear model methods like the lasso (Tibshirani, 1996), but moreover, it encompasses cases where the distribution of Y|X is non-normal or the predictors and the response are not linearly related.Item Edge detection and image restoration of blurred noisy images using jump regression analysis(2013-08) Kang, YichengWe consider the problem of edge-preserving image restoration when images are degraded by spatial blur and pointwise noise. When the spatial blur described by a point spread function (psf) is not completely specified beforehand, this is a challenging <&ldquo>ill-posed< &rdquo> problem, because (i) theoretically, the true image can not be uniquely determined by the observed image when the psf is unknown, even in cases when the observed image contains no noise, and (ii) practically, besides blurring, observed images often contain noise, which can cause numerical instability in many existing image deblurring procedures. In the literature, most existing deblurring procedures are developed under the assumption that the psf is completely specified, or that the psf follows a parametric form with one or more unknown parameters. In this dissertation, we propose blind image deblurring (BID) methodologies that do not require such restrictive conditions on the psf. They even allow the psf to change over location. This dissertation has three chapters. Chapter 1 introduces some motivating applications for image processing along with presenting the overall scope of the dissertation. In Chapter 2, the problem of step edge detection in blurred noisy images is studied. In Chapter 3, a BID procedure based on edge detection is proposed. In Chapter 4, an efficient BID procedure without explicitly detecting edges is presented. Both theoretical justifications and numerical studies show that our proposed procedures work well in applications.Item Edge structure preserving 2-D and 3-D image denoising by jump surface estimation.(2011-08) Mukherjee, Partha SarathiImage denoising is often used for pre-processing images so that subsequent image analyses are more reliable. Many existing methods can not preserve complicated edge-structures well, but those structures contain useful information about the image objects. So, besides noise removal, a good denoising method should preserve important edge-structures. The major goal of this dissertation is to develop image denoising techniques so that complicated edge-structures are preserved efficiently. The developed methods are based on nonparametric estimation of discontinuous surfaces, because a monochrome image can be regarded as a surface of the image intensity function and its discontinuities are usually at the outlines of the objects. The first part of this dissertation introduces some existing methods and related literature. Next, an edge-structure preserving 2-D image denoising technique is proposed, and it is shown that it performs well in many applications. The next part considers 3-D images. Because of emerging popularity of 3-D MRI images, 3-D image denoising becomes an important research area. The edge-surfaces in 3-D images can have much more complicated structures, compared to the edge-curves in 2-D images. So, direct generalizations of 2-D methods would not be su#14;cient. This part handles the challenging task of mathematically describing di#11;erent possible structures of the edge-surfaces in 3-D images. The proposed procedures are shown to outperform many popular methods. The next part deals with the well-known bias issue in denoising MRI images that is corrupted with rician noise, and provides an efficient method to remove that bias. The final part of this dissertation discusses the future research directions along the line of previous parts. One of them is image denoising by appropriate multilevel local smoothing techniques so that the #12;ne details of the images are well preserved.Item Forecast combination for outlier protection and forecast combination under heavy tailed errors(2014-11) Cheng, GangForecast combination has been proven to be a very important technique to obtain accurate predictions. Numerous forecast combination schemes with distinct properties have been proposed. However, to our knowledge, little has been discussed in the literature on combining forecasts with minimizing the occurrence of forecast outliers in mind. An unnoticed phenomenon is that robust combining, which often improves predictive accuracy (under square or absolute error loss) when innovation errors have a tail heavier than a normal distribution, may have a higher frequency of prediction outliers. Given the importance of reducing outlier forecasts, it is desirable to seek new loss functions to achieve both the usual accuracy and outlier-protection simultaneously.In the second part of this dissertation, we propose a synthetic loss function and apply it on a general adaptive theoretical and numeric results support the advantages of the new method in terms of providing combined forecasts with relatively fewer large forecast errors and comparable overall performances. For various reasons, in many applications, forecast errors exhibit heavy tail behaviors. Unfortunately, to our knowledge, little has been done to deal with forecast combination for such situations. The familiar forecast combination methods such as simple average, least squares regression, or those based on variance-covariance of the forecasts, may perform very poorly in such situations. In the third part of this dissertation, we propose two forecast combination methods to address the problem. One is specially proposed for the situations that the forecast errors are strongly believed to have heavy tails that can be modeled by a scaled Student's t-distribution; the other is designed for relatively more general situations when there is a lack of strong or consistent evidence on the tail behaviors of the forecast errors due to shortage of data and/or evolving data generating process. adaptive risk bounds of both methods are developed. Simulations and a real example show the excellent performance of the new methods.Item Gender, Sex, and Sexuality in Secondary Statistics(2022-06) Parise, MeganStatistics and data analysis have been part of the K-12 mathematics curriculum for the past few decades, and in conjunction with mathematics standards documents, the Guidelines for Assessment and Instruction in Statistics Education report clarified learning progressions for statistical content in K-12 mathematics (Franklin et al., 2007). Yet many secondary mathematics teachers struggle with teaching statistics because of its dependence on context and its use of variability (Cobb & Moore, 1997). Because of this struggle, secondary mathematics teachers who teach statistics may rely heavily on textbooks and pre-packaged curricula to drive their instruction. However, as I will demonstrate in the first paper of this dissertation, commercially published secondary statistics curricula in the United States often project a narrowed view of the world with respect to the types of contexts they use to develop statistical understanding. The first paper was the impetus for this three-paper series on statistics curricula. In this study, I used queer theory and critical mathematics education to examine the exercises, examples, and other text from three widely circulated statistics textbooks. I then applied critical discourse analysis to develop overarching themes related to the way in which the identities of gender, sex, and sexuality are developed through the sample textbooks. I found that, in addition to defining sex and gender as conflated and binary, the textbooks also construct identities of in ways that maintain strict boundaries between women/females and men/males, and these boundaries uphold heteronormative ideologies. This paper has implications for textbook publishers, teachers, and researchers and has been published in a special issue on Gender in Mathematics in Mathematics Education Research Journal (Parise, 2021). Based on the findings from this paper, I wanted to explore how statistics students and teachers interacted with these identity constructions. Therefore, in the second paper, I examine how statistics students are implicated in telling a heteronormative narrative through statistics textbook word problems that use gender, sex, and sexuality as context. I draw from Gerofsky’s (1996) research which establishes mathematics word problems as genre with specific story-like components. I then apply Wortham’s (2003) work on discursive parallelism to demonstrate how the statistics student engaging in the problem is complacent in completing a heteronormative narrative to be academically successful. As the problem progresses, and the parallelism between the two students is solidified, the real student doing the problem merges with the fictitious student in the word problem. The real student confirms the stereotype that women only date men who are taller than they are and then removes an “abnormally” tall woman from the data set. This narrative is then reinforced by a statistical calculation, the correlation coefficient. This paper has implications for teachers who aim to counter the overwhelmingly heteronormative ideologies present in mathematics and statistics textbooks. The third paper builds on the first two by examining how statistics teachers enact curriculum and analyzing teachers’ commitments and actions that disrupt heteronormative and gender/sex binary narratives in their curricular resources. For paper three, I review background literature on teachers’ use of curriculum as well as on how statistics teachers committed to justice-oriented teaching use curricular materials to attend to social issues in their classrooms. As a theoretical lens, I employed Gutstein’s (2006) teaching mathematics for social justice to create a justice-oriented statistics teaching framework. I interviewed Advanced Placement Statistics teachers who align their teaching philosophies toward justice-oriented statistics teaching and asked questions related to how they use or modify their curricular materials to address issues of sex, gender, and sexuality in class. I found that the type of curricular material mediated the teachers’ perceived authority over modifying the resource, particularly when they use Advanced Placement practice items. Lastly, I discuss how secondary statistics teachers can encourage their students to apply a critical lens to Advanced Placement practice items in order to develop critical statistics literacy.Item Geometric ergodicity of a random-walk metorpolis algorithm via variable transformation and computer aided reasoning in statistics.(2011-06) Johnson, Leif ThomasWith the steady increase of affordable computing, more and more often analysts are turning to computationally intensive techniques like Markov chain Monte Carlo (MCMC). To properly quantify the quality of their MCMC estimates, analysts need to know how quickly the Markov chain converges. Geometric ergodicity is a very useful benchmark for how quickly a Markov chain converges. If a Markov chain is geometrically ergodic, there are easy to use consistent estimators of the Monte Carlo standard errors (MCSEs) for the MCMC estimates, and an easy to use Central Limit Theorem for the MCMC estimates. We provide a method for finding geometrically ergodic Markov chains for classes of target densities. This method uses variable transformation to induce a proxy target distribution, and a random-walk Metropolis algorithm for the proxy target distribution will be geometrically ergodic. Because the transformations we use are one-to-one, Markov chains generated for the proxy target density can be transformed back to a sample from the original target density without loss of inference. Furthermore, because the Markov chain for the proxy distribution is geometrically ergodic, the consistent MCSEs and the CLT apply to the sample on the scale of original target density. We provide example applications of this technique to multinomial logit regression and a multivariate T distribution. Computer Aided Reasoning (CAR) uses a computer program to assist with mathematical reasoning. Proofs done in a proof assistant program are done formally, every step and inference is validated back to the axioms and rules of inference. Thus we have higher confidence in the correctness of a formally verified proof than one done with the traditional paper-and-pencil technique. Computers can track many more details than can be done by hand, so more complicated proofs with more cases and details can be done in CAR than can be done by hand, and proof assistant programs can help fill in the gaps in a proof. We give a brief overview of the proof assistant program HOL Light, and use it to formally prove the Markov inequality with an expectation based approach.Item Geometric ergodicity of Gibbs samplers.(2009-07) Johnson, Alicia A.Due to a demand for reliable methods for exploring intractable probability distributions, the popularity of Markov chain Monte Carlo (MCMC) techniques continues to grow. In any MCMC analysis, the convergence rate of the associated Markov chain is of practical and theoretical importance. A geometrically ergodic chain converges to its target distribution at a geometric rate. In this dissertation, we establish verifiable conditions under which geometric ergodicity is guaranteed for Gibbs samplers in a general model setting. Further, we show that geometric ergodicity of the deterministic scan Gibbs sampler ensures geometric ergodicity of the Gibbs sampler under alternative scanning strategies. As an illustration, we consider Gibbs sampling for a popular Bayesian version of the general linear mixed model. In addition to ensuring the rapid convergence required for useful simulation, geometric ergodicity is a key sufficient condition for the existence of central limit theorems and consistent estimators of Monte Carlo standard errors. Thus our results allow practitioners to be as confident in inference drawn from Gibbs samplers as they would be in inference drawn from random samples from the target distribution.Item Graphical methods of determining predictor importance and effect(2008-08) Rendahl, Aaron KjellMany experiments and studies are designed to discover how a group of predictors affect a single response. For example, an agricultural scientist may perform an experiment to determine how rainfall, sunlight, and fertilizer affect plant growth. In situations like this, graphical methods to show how the various predictors affect the response and the relative importance of each predictor can be invaluable, not only in helping the researcher understand the results, but also in communicating the findings to non-specialists. For settings where a simple statistical model can be used to fit the data, several graphical methods for showing the effect of individual predictors already exist. However, few methods are available for more complex settings that require more complex models. A framework for understanding the existing methods is developed using Cook's net-effect plots, and a criterion for evaluating and creating methods is proposed. This criterion states that for a plot to be most useful in showing how a given predictor affects the response, the conditional distribution of the vertical axis given the horizontal axis should be independent of the other predictors. That is, the plot should not hide any additional information gained by knowing the other predictors. This proposed framework and criterion is used to develop graphical methods appropriate for use in more complex modeling algorithms. In particular, these plots have been explored in the context of model combining methods, and various versions compared and analyzed. Additionally, the weights from these model combining methods are used to modify existing methods of determining predictor importance values, resulting in improved values for spurious predictors.Item Incident cataracts following protracted low-dose occupational ionizing radiation exposures in United States medical radiologic technologists: Statistical methods for exploring heterogeneity of effects and improving causal inference(2016-02) Meyer, CraigBackground: Medical radiologic technologists are routinely exposed to protracted low-dose occupational ionizing radiation. The U.S. Rad Tech (USRT Study) was begun in 1982 by the National Cancer Institute in collaboration with the University of Minnesota School of Public and the American Registry of Radiologic Technologists to investigate potential health risks from occupational ionizing radiation. Ionizing radiation exposures have been associated with cataracts, which if left untreated can lead to visual impairment or blindness. Phenotypes of cataracts are characterized by their location in the eye lens and include posterior subcapsular and cortical cataracts (most commonly associated with ionizing radiation), and nuclear cataracts (associated with age). Methods that allow investigators to flexibly examine the extent of heterogeneity across many covariate strata are needed to help characterize the extent of any heterogeneity. One such potential method is boosted regression trees, a machine learning ensemble model that is particularly well suited to prediction while incorporating interactions. As prediction is becoming increasingly important for epidemiologic investigations (causal inference methods commonly require the use of prediction), exploration of the utility of machine learning methods in epidemiology is warranted. Occupational epidemiologic cohort studies are often susceptible to selection bias from the healthy worker survivor effect (HWSE), whereby less healthy individuals leave work and accrue less exposure compared to healthier individuals who stay at work and continue to accrue exposure. As a result, the association between exposure and an outcome may be attenuated, or even reversed in some cases. G-methods are a family of analytical tools that were developed to address situations that may be affected from time-varying confounding and structural bias as seen in the HWSE. One such method, the parametric g-formula, is a rigorous computational model that has been used to correct effects estimates for potential bias from the HWSE. Objective: The overall objective of this research is to explore the relationship between protracted low-dose exposures to occupational ionizing radiation and the risk of cataracts in medical radiologic technologists in the United States and its territories, and to propose methodologic techniques to help estimate causal effects in such settings. The overall objective of this research will be accomplished in three separate manuscripts. Manuscript 1: Aim: To estimate the overall association between protracted exposure to low-dose occupational ionizing radiation and incident cataracts in medical radiologic technologists. Methods: Cox regression models were used to model time to cataract predicted by ionizing radiation. Technologists were followed from year first worked as a radiologic technologist starting at age 18 or older, until report of cataracts or administrative censoring at the third survey. Results: After adjustment for birth year, sex, and race / ethnicity (N=69,798), ionizing radiation was significantly associated with increased hazard of cataracts with a time-varying effect (p<0.001) that while initially elevated, decreased over time. Hazard ratios of cataract per 10-mSv increment of radiation were statistically significant at age 20 [HR=1.09; 95% CI = (1.04, 1.14)] and age 30 [HR = 1.04; 95% CI = (1.00, 1.09)], but were not significant after age 30. Sensitivity analyses indicated strong evidence that selection bias from the HWSE were present and may have explained the time-varying effect. Additionally, a literature review found five population-based studies of cataract subtype prevalence over time, and indicated that there was potential for misclassification of cataracts in the USRT study that may have biased effect estimates. Manuscript 2: Aim: Use boosted regression trees to fully characterize the distribution of the effect of occupational ionizing radiation on cataracts in medical radiologic technologists. Methods: A boosted regression tree model was used to build a prediction model of cataracts. The cohort was restricted to those ages 24–44 at baseline (N=43,513). Predictions from the model were used to calculate risk differences of cataracts between high dose (75th percentile of observed badge dose: 61.31 mSv) and low-dose (25th percentile of observed badge dose: 23.90 mSv) occupational ionizing radiation in strata of potential effect modifiers. Results: Overall, there was a significant population average effect [RD=0.002; 95% CI = (0.002, 0.015)]. Additionally, subgroups were found with larger risks than the population average including those born earliest, those with diabetes, macular degeneration, glaucoma, or were overweight (BMI > 25) at baseline. Those who were youngest and those without macular degeneration conversely had lower risk differences compared to the average. Manuscript 3: Aim: Use the parametric G-formula to adjust effect estimates for the healthy worker survivor effect in the estimated risk of incident radiogenic cataracts in medical radiologic technologists. Methods: The parametric g-formula was used to estimate cataract risks under different hypothetical scenarios limiting badge dose in five-year periods to the 80th percentile (badge dose ≤ 18.38 mSv), 60th percentile (badge dose ≤ 9.06 mSv), 40th percentile (badge dose ≤ 4.47 mSv), and 20th percentile (badge dose ≤ 2.08 mSv) of observed dose, and a 5-mSv reduction in dose estimates in each period over follow-up (N=69,798). Cumulative incidence risks and risks conditional on survival of cataracts from these treatment regimes were compared to the status quo (no intervention of dose) with risk differences and 95% confidence intervals. Substantively important differences in both cumulative incidence of cataracts and conditional risks of cataracts between the natural course and treatment regimes were found. There was evidence that decreasing the dose of radiation exposure could reduce the risk of cataracts, even at relatively early ages. Conclusion: Overall, our results indicate that low-dose occupational ionizing radiation exposures elevate the risks of cataracts in medical radiologic technologists in the USRT Study, as our three manuscripts found significant associations between occupational ionizing radiation and cataract risks. Additionally, methods were proposed to explore heterogeneity of effects and improve the causal interpretation of effect estimates in the association between ionizing radiation and cataracts. Validation of cataracts is warranted and future studies would benefit from information regarding phenotypes of cataracts.Item Incorporation of Covariates in Bayesian Piecewise Growth Mixture Models(2022-12) Lamm, RikThe Bayesian Covariate Influenced Piecewise Growth Mixture Model (CI-PGMM) is an extension of the Piecewise Growth Mixture Model (PGMM, Lock et al., 2018) with the incorporation of covariates. This was done by using a piecewise nonlinear trajectory over time, meaning that the slope has a point where the trajectory changes, called a knot. Additionally, the outcome data belong to two or more latent classes with their own mean trajectories, referred to as a mixture model. Covariates were incorporated into the model in two ways. The first was influencing the outcome variable directly, explaining additional random error variance. The second is the influence of the covariates on the class membership directly with the use of multinomial logistic regression. Both uses of covariates can potentially influence the class memberships and along with that, the trajectories and locations of the knot(s). This additional explanation of class memberships and trajectories can provide information on how individuals change, who is likely to belong in certain unknown classes, and how these class memberships can affect when the rapid change of a knot will happen. The model is shown to be appropriate and effective using two steps. First, a real data application using the National Longitudinal Survey of Youth is used to show the motivation for the model. This dataset measures income over time each year for individuals following high school. Covariates of sex and dropout status were used in the class predictive logistic regression model. This resulted in a two-class solution showing effective use of the covariates with the logistic regression coefficients drastically affecting the class memberships. The second step is using a simulation after the motivating real data application. Pilot studies were used to show if the model was suitable for a full simulation using the coefficients from the real data example as a basis for the data generation. Four pilot studies were performed, and reasonable estimates were found for the full simulation. The conditions were set up with a two class model. One class containing one knot, and the second class as a linear slope. Two class predictive covariates and one outcome predictive covariate were used. A full simulation with 200 generated datasets was performed with manipulated conditions being error variance, sample size, model type, and class probability for a 3x3x3x2 model with 54 total conditions. Outcome measures of convergence, average relative bias, RMSE, and coverage rate were used to show suitability of the model. The simulation showed the use for the CI-PGMM was stable and accurate for multiple conditions. Sample size and model type were the most impactful predictors of appropriate model use. All outcome measures were worse for the small sample sizes and became more accurate when the sample sizes were larger. Also, the simpler models showed less bias and better convergence. However, these differences are smaller when the sample size is sufficiently large. These findings were supported with multi-factor ANOVA comparing simulation conditions. Use of the CI-PGMM in the real data example and the full simulation allowed for incorporation of covariates when appropriate. I show that model complexity can lead to issues of lower convergence, thus the model should only be used when appropriate and the sample size is sufficiently large. When used, however, the model can shed light on associations between covariates, class memberships, and locations of knots that were previously unavailable.Item Leveraging Summary Statistics and Integrative Analysis for Prediction and Inference in Genome-Wide Association Studies(2020-07) Pattee, JackGenome-wide association studies (GWASs) have attained substantial success in parsing the genetic etiology of complex traits. GWAS analyses have identified many genetic variants associated with various traits, and polygenic risk scores estimated from GWASs have been used to effectively predict certain clinical phenotypes. Despite these accomplishments, GWASs suffer from some pervasive issues with power and interpretability. To address these issues, we develop powerful and novel approaches for prediction and inference on genetic and genomic data. Our approaches focus on two key elements. First is the incorporation of additional sources of genetic and genomic data. A typical GWAS characterizes the genetic basis of a trait in terms of associations between the trait and a set of single nucleotide polymorphisms (SNPs). This approach can often be underpowered and difficult to understand biologically. We can often increase power and interpretability by effectively incorporating other sources of genetic and genomic data into the single SNP analysis structure. Second is the development of methods that are widely applicable in the context of summary statistics. Many published GWAS analyses do not provide so-called individual level genetic and genomic data, and instead provide only summary statistic information. Given this, we want our methods to be able to be flexible in the context of summary statistics without the need for individual level information. We first develop a novel approach to integrating somatic and germline information from tumors to identify genes associated with lung cancer risk. We leverage this approach to discover potentially novel genes associated with lung cancer. We then investigate the problem of estimating powerful and parsimonious models for polygenic risk scores in the context of summary statistics. We develop a set of novel methods for model estimation, model selection, and the assessment of model performance, and demonstrate their beneficial properties in extensive simulation and in application to GWASs of lung cancer, blood lipid levels, and height. Lastly, we integrate our methods for polygenic risk score estimation into a two sample two-stage least squares analysis framework to identify potentially novel endophenotypes associated with increased risk of Alzheimer's disease. We demonstrate via simulation and real data application that our approach is powerful and effective.Item Likelihood ratio tests for high-dimensional normal distributions.(2011-12) Yang, FanFor a random sample of size n obtained from p-variate normal distributions, we consider the likelihood ratio tests (LRT) for their means and covariance matrices. Most of these test statistics have been extensively studied in the classical multivariate analysis and their limiting distributions under the null hypothesis were proved to be a Chi-Square distribution under the assumption that n goes to infinity while p remains fixed. In our research, we consider the high-dimensional case where both p and n go to infinity and their ratio p/n converges to a constant y in (0, 1]. We prove that the likelihood ratio test statistics under this assumption will converge in distribution to a normal random variable and we also give the explicit forms of its mean and variance. We run simulation study to show that the likelihood ratio test using this new central limit theorem outperforms the one using the traditional Chi-square approximation for analyzing high-dimensional data.
- «
- 1 (current)
- 2
- 3
- »