Browsing by Subject "statistics"
Now showing 1 - 6 of 6
- Results Per Page
- Sort Options
Item A combined statistical and machine learning approach for single channel speech enhancement(2015-05) Tseng, Hung-WeiIn this thesis, we study the single-channel speech enhancement problem, the goal of which is to recover a desired speech from a monaural noisy recording. Speech enhancement is a focal issue to study due to is widespread usage in speech-related applications, such as hearing aids, mobile communications, and speech recognition systems. Three speech enhancement algorithms are proposed. In the rst algorithm, the Wiener Non-negative Matrix Factorization (WNMF), we combine the traditional Wiener ltering and the NMF into a single optimization problem. The objective is to minimize the mean square error, similar to Wiener ltering, and the constraints ensure the enhanced speeches are sparsely representable by the speech model learned by NMF. WNMF is novel because it utilizes NMF to capture the speech-specific structure while simultaneously leveraging it, thus improving the Wiener filtering. For the second algorithm, we propose a Sparse Gaussian Mixture Model (SGMM) that extends the traditional NMF and the Gaussian model. SGMM better captures the complex structure of speech than the traditional NMF. To control for overrepresentation of SGMM, we impose sparsity in order to ensure that only a few Gaussian models are simultaneously active. Computationally, it is achieved by using a l0-norm in the constraint of the maximum-likelihood (ML) estimation. The contribution of SGMM is in solving the constrained ML estimation, which has a closed form update even with the non-convex and non-smooth l0-norm constraint. The final algorithm proposed is the Sparse NMF + Deep Neural Network (SNMF-DNN), in which we treat speech enhancement as a supervised regression problem - the goal being to estimate the optimal enhancement gain. SNMF, originally designed for source separation, is used to extract features from the noisy recording. DNN is subsequently trained to estimate the optimal enhancement gain. Although our system is simple and does not require any sophisticated handcrafted features, we are able to demonstrate a substantial improvement in both intelligibility and enhanced speech quality.Item High Dimensional Tobit Regression(2023-07) Jacobson, TateHigh-dimensional regression and regression with a left-censored response are each well-studied topics. In spite of this, few methods have been proposed which deal with both of these complications simultaneously. To fill this gap, we develop extensions of the Tobit model—a standard method for censored regression in economics—for high-dimensional estimation and inference. We focus first on estimation, introducing several penalized Tobit estimators and developing a fast algorithm which combines quadratic majorization with coordinate descent to compute the penalized Tobit solution path. Theoretically, we analyze the Tobit lasso and Tobit with a folded concave penalty, bounding the ℓ2 estimation loss for the former and proving that a local linear approximation estimator for the latter possesses the strong oracle property in an ultra high-dimensional setting. In a thorough simulation study, we assess the prediction, estimation, and selection performance of our penalized Tobit models on high-dimensional left-censored data. We then shift our attention to inference. Few methods have been developed for conducting statistical inference in high-dimensional left-censored regression. Among the methods that do exist, none are flexible enough to test general linear hypotheses—that is, all hypotheses of the form H0: Cβ*M = t. We fill this gap by developing partial penalized Wald, score, and likelihood ratio tests for testing general linear hypotheses in high-dimensional Tobit models. We derive approximate distributions for the partial penalized test statistics under the null hypothesis and local alternatives in an ultra high-dimensional setting. We propose an alternating direction method of multipliers algorithm to compute the partial penalized test statistics. Through an extensive empirical study, we show that the partial penalized Tobit tests achieve their nominal size and are consistent in a finite sample settingItem Integrating Human and Machine Intelligence in Galaxy Morphology Classification Tasks(2018-01) Beck, MelanieThe large flood of data flowing from observatories presents significant challenges to astronomy and cosmology – challenges that will only be magnified by projects currently under development. Growth in both volume and velocity of astrophysics data is accelerating: whereas the Sloan Digital Sky Survey (SDSS) has produced 60 terabytes of data in the last decade, the upcoming Large Synoptic Survey Telescope (LSST) plans to register 30 terabytes per night starting in the year 2020. Additionally, the Euclid Mission will acquire imaging for ∼ 5 × 10^7 resolvable galaxies. The field of galaxy evolution faces a particularly challenging future as complete understanding often cannot be reached without analysis of detailed morphological galaxy features. Historically, morphological analysis has relied on visual classification by astronomers, accessing the human brains capacity for advanced pattern recognition. However, this accurate but inefficient method falters when confronted with many thousands (or millions) of images. In the SDSS era, efforts to automate morphological classifications of galaxies (e.g., Conselice et al., 2000; Lotz et al., 2004) are reasonably successful and can distinguish between elliptical and disk-dominated galaxies with accuracies of ∼80%. While this is statistically very useful, a key problem with these methods is that they often cannot say which 80% of their samples are accurate. Furthermore, when confronted with the more complex task of identifying key substructure within galaxies, automated classification algorithms begin to fail. The Galaxy Zoo project uses a highly innovative approach to solving the scalability problem of visual classification. Displaying images of SDSS galaxies to volunteers via a simple and engaging web interface, www.galaxyzoo.org asks people to classify images by eye. Within the first year hundreds of thousands of members of the general public had classified each of the ∼1 million SDSS galaxies an average of 40 times. Galaxy Zoo thus solved both the visual classification problem of time efficiency and improved accuracy by producing a distribution of independent classifications for each galaxy. While crowd-sourced galaxy classifications have proven their worth, challenges remain before establishing this method as a critical and standard component of the data processing pipelines for the next generation of surveys. In particular, though innovative, crowd-sourcing techniques do not have the capacity to handle the data volume and rates expected in the next generation of surveys. These algorithms will be delegated to handle the majority of the classification tasks, freeing citizen scientists to contribute their efforts on subtler and more complex assignments. This thesis presents a solution through an integration of visual and automated classifications, preserving the best features of both human and machine. We demonstrate the effectiveness of such a system through a re-analysis of visual galaxy morphology classifications collected during the Galaxy Zoo 2 (GZ2) project. We reprocess the top-level question of the GZ2 decision tree with a Bayesian classification aggregation algorithm dubbed SWAP, originally developed for the Space Warps gravitational lens project. Through a simple binary classification scheme we increase the classification rate nearly 5-fold classifying 226,124 galaxies in 92 days of GZ2 project time while reproducing labels derived from GZ2 classification data with 95.7% accuracy. We next combine this with a Random Forest machine learning algorithm that learns on a suite of non-parametric morphology indicators widely used for automated morphologies. We develop a decision engine that delegates tasks between human and machine and demonstrate that the combined system provides a factor of 11.4 increase in the classification rate, classifying 210,803 galaxies in just 32 days of GZ2 project time with 93.1% accuracy. As the Random Forest algorithm requires a minimal amount of computational cost, this result has important implications for galaxy morphology identification tasks in the era of Euclid and other large-scale surveys.Item Residuals and Influence in Regression(New York: Chapman and Hall, 1982) Cook, R. Dennis; Weisberg, SanfordItem Social cohesion or ‘myth of oneness’?: Implications of the ban on ethnicity statistics in Fiji(2024-05-01) Nailatikau, MerewalesiRace and ethnicity have played significant roles in Fiji’s political landscape since gaining independence in 1970. Acknowledging the distinction between ‘race’ and ‘ethnicity,’ the terms are often used interchangeably in Fijian nomenclature practice, particularly concerning relations between indigenous Fijians and Indo-Fijians. The Bainimarama regime, following the 2006 military coup, implemented policies erasing ethnic identifiers and mandating ‘Fijian’ for all citizens, while prohibiting the publication of racially disaggregated statistics under the guise of combating racism. This move hindered understanding of poverty experiences among different communities. The newly elected government in 2022 has lifted these restrictions, focusing on economic recovery through a consultative multi-sectoral approach. This paper examines Fiji census data and government addresses to explore the implications of the 16-year ban on publishing ethnically disaggregated statistics on collective memory and data equity. Despite efforts to shape a master narrative, the ban has hindered progress in racial equity and understanding emerging inequality hotspots. Recommendations include advancing an integrated national data system, incorporating data in truth and reconciliation processes, establishing institutional norms to prevent abuse of power, and fostering social cohesion through consensus-building that acknowledges diverse perspectives.Item Talking in Code: Code Review as a Form of Communication(2023) Lisinker, ReginaAs coding and computation increasingly permeate statistics and data science courses, it is important for students to not only learn coding syntax, but also how to communicate their work. The process of code review enhances team communication by implementing a consistent feedback loop between coder and reviewer(s). While code review is commonplace in industry, it is not often implemented in data science classrooms. For this study, teams of undergraduate data science majors partnered with local community organizations to work on a data-focused problem. Students were given code review resources to utilize during the latter half of their projects. Data was collected through surveying students and interviewing their faculty advisors after project completion. This thesis presents results from these data to inform how students utilized the materials, their code review processes, and how they communicate via code review.