Probabilistic models have a rich history in machine learning, offering a theoretical and practical framework for learning from observed data. Probabilistic models describe relationships between observed data and latent variables in terms of probability distributions. Practitioners in many fields of science have long been attracted to probabilistic methods as a way to quantify uncertainty in predictions and models, query models via inference, and estimating latent variables. In this thesis, Advancing Probabilistic Models for Approximate and Exact Inference, we connect foundational ideas in machine learning, such as probabilistic inference and ensemble learning, with deep learning. More specifically, the focus lies on the design of generative models with likelihood-based objective functions, which offer a solution for overcoming many broader challenges in machine learning---namely, explaining all of the data, efficient data usage, and quantifying uncertainty. For over two decades graphical models were the predominant paradigm for composing probabilistic models in machine learning. By composing probability distributions as building blocks for larger models, graphical models offer a comprehensible model-building framework that can be tailored to the structure of the data. As a build up to further work, we introduce a novel probabilistic graphical model for analyzing text datasets of multiple authors writing over time. In the era of big data, however, it is necessary to scale such models to large datasets. To that end, we propose an efficient learning algorithm that allows for training and probabilistic inference on text datasets with billions of words with general-purpose computing hardware. Recently, breakthroughs in deep learning have ushered in an explosion of new successes in probabilistic modeling, with models capable of modeling enormous collections of complex data and generating novel yet plausible data (e.g. new images, text, and speech). One promising direction in likelihood-based probabilistic deep learning is normalizing flows. Normalizing flows use invertible transformations to translate between simple and complex distributions, which allows for exact likelihood calculation and efficient sampling. In order to remain invertible and provide exact likelihood calculations, normalizing flows must be composed of differentiable bijective functions. However, bijections require that the inputs and outputs have the same dimensionality---which can pose signiﬁcant architectural, memory, and computational costs for high-dimensional data. We introduce Compressive Normalizing Flows that are, in the simplest case, equivalent to the probabilistic principal components analysis (PPCA). The PPCA-based compressive flow relaxes the bijective constraints and allows the model to learn a compressed latent representation, while offering parameter updates that are available analytically. Drawing on the connection between PPCA and Variational Autoencoders (VAE)---a powerful deep generative model, we extend our framework to VAE-based compressive flows for greater flexibility and scalability. Up until now the trend in normalizing flow literature has been to devise deeper, more complex transformations to achieve greater flexibility. We propose an alternative: Gradient Boosted Normalizing Flows (GBNF) model a complex density by successively adding new normalizing flow components via gradient boosting so that each new component is fit to the residuals of the previously trained components. Because each flow component is itself a density estimator, the aggregate GBNF model is structured like a mixture model. Moreover, GBNFs offer a wider, as opposed to strictly deeper, approach that improves existing NFs at the cost of additional training---not more complex transformations. Lastly, we extend normalizing flows beyond their original unsupervised formulation, and present an approach for learning high-dimensional distributions conditioned on low-dimensional samples. In the context of image modeling, this is equivalent to image super-resolution---the task of mapping a low-resolution (LR) image to a single high-resolution (HR) image. Super-resolution, however, is an ill-posed problem since there are infinitely many HR samples that are compatible with a given LR sample. Approaching super-resolution with likelihood-based models, like normalizing flows, allows us to learn a distribution over all possible HR samples. We present Probabilistic Super-Resolution (PSR) using Normalizing Flows for learning conditional distributions as well as joint PSR where the high- and low-dimensional distributions are modeled simultaneously. However, our approach is not solely for image modeling. Any dataset can be formulated for super-resolution, and using a PSR architecture alleviates challenges commonly associated with normalizing flows such---as the information bottleneck problem, and the inductive biases towards modeling local correlations.