Modeling Outputs of Efficient Compressibility Estimators

Asamoah Owusu, Dennis2018-09-212018-09-212018-06https://hdl.handle.net/11299/200159University of Minnesota M.S. thesis. June 2018. Major: Computer Science. Advisor: Peter Peterson. 1 computer file (PDF); viii, 94 pages.There are times when it is helpful to know whether data is compressible before expending computational resources to compress it. The standard deviation of the byte distribution of data is an example of a measure of compressibility that does not involve actually compressing the data. This work considered five such measures of compressibility: byte standard deviation, shannon entropy, “average meaning entropy”, “byte counting” and “heuristic method”. We developed models that relate the output of these measures to the compression ratios of gzip, lz4 and xz using data retrieved from browsing Facebook, Wikipedia and YouTube. The models for byte standard deviation, shannon entropy and “average meaning entropy” were linear in both the parameters and the variables. The model for “byte counting” was non-linear in the predictor variable but linear in the parameters. The “heuristic method” was a classification model. In general, there was a strong relationship between the measures and the compressibility of a given data. Also, in many cases the models developed using one set of data from a source (like Youtube) was able to estimate the compressibility of another data set from the same source to a useful extent. This suggests the potential for developing a model per ECE for a source that can predict, to a useful degree, the compressibility of data from that source. At the same time, the differences in accuracy when models were evaluated on the data they were developed from versus when evaluated on new data from the same source indicate that there are important differences in the nature of the data coming from even the same source.enbytecompressionentropymodelpredictionregressionModeling Outputs of Efficient Compressibility EstimatorsThesis or Dissertation