The growing disparity between data set sizes and the amount of fast internal memory available in modern computer systems is an important challenge facing a variety of application domains. This problem is partly due to the incredible rate at which data is being collected, and partly due to the movement of many systems towards increasing processor counts without proportionate increases in fast internal memory. Without access to sufficiently large machines, many application users must balance a trade-off between utilizing the processing capabilities of their system and performing computations in memory. In this thesis we explore several approaches to solving this problem. We develop effective and efficient algorithms for compressing scientific simulation data computed on structured and unstructured grids. A paradigm for lossy compression of this data is proposed in which the data computed on the grid is modeled as a graph, which gets decomposed into sets of vertices which satisfy a user defined error constraint, epsilon. Each set of vertices is replaced by a constant value with reconstruction error bounded by epsilon. A comprehensive set of experiments is conducted by comparing these algorithms and other state-of-the-art scientific data compression methods. Over our benchmark suite, our methods obtained compression of 1% of the original size with average PSNR of 43.00 and 3% of the original size with average PSNR of 63.30. In addition, our schemes outperform other state-of-the-art lossy compression approaches and require on the average 25% of the space required by them for similar or better PSNR levels. We present algorithms and experimental analysis for five data structures for representing dynamic sparse graphs. The goal of the presented data structures is two fold. First, the data structures must be compact, as the size of the graphs being operated on continues to grow to less manageable sizes. Second, the cost of operating on the data structures must be within a small factor of the cost of operating on the static graph, else these data structures will not be useful. Of these five data structures, three are approaches, one is semi-compact, but suited for fast operation, and one is focused on compactness and is a dynamic extension of any existing technique known as the WebGraph Framework. Our results show that for well intervalized graphs, like web graphs, the semi-compact is superior to all other data structures in terms of memory and access time. Furthermore, we show that in terms of memory, the compact data structure outperforms all other data structures at the cost of a modest increase in update and access time. We present a virtual memory subsystem which we implemented as part of the BDMPI runtime. Our new virtual memory subsystem, which we call SBMA, bypasses the operating system virtual memory manager to take advantage of BDMPI's node-level cooperative multi-taking. Benchmarking using a synthetic application shows that for the use cases relevant to BDMPI, the overhead incurred by the BDMPI-SBMA system is amortized such that it performs as fast as explicit data movement by the application developer. Furthermore, we tested SBMA with three different classes of applications and our results show that with no modification to the original program, speedups from 2x--12x over a standard BDMPI implementation can be achieved for the included applications. We present a runtime system designed to be used alongside data parallel OpenMP programs for shared-memory problems requiring out-of-core execution. Our new runtime system, which we call OpenOOC, exploits the concurrency exposed by the OpenMP semantics to switch execution contexts during non-resident memory access to perform useful computation, instead of having the thread wait idle. Benchmarking using a synthetic application shows that modern operating systems support the necessary memory and execution context switching functionalities with high-enough performance that they can be used to effectively hide some of the overhead incurred when swapping data between memory and disk in out-of-core execution environments. Furthermore, we tested OpenOOC with practical computational application and our results show that with no structural modification to the original program, runtime can be reduced by an average of 21% compared with the out-of-core equivalent of the application.