A text preprocessing algorithm that improved the compression ratio of a standard compression tool, bzip2, was developed. During the preprocessing, characters in the original text files were replaced by a few special characters such that the Burrows-Wheeler Transform part of bzip2 was enhanced upon. These special characters made the words less recognizable but they were still recoverable through the exploitation of the semantic relations between words in the text. The recovery process was carried out with the use of a static English dictionary and a pretrained static neural network, Word2vec. Experiments showed that this method increased in the compression ratio at an average rate of 2.9% for text files over 100KB.
This research was supported by the Undergraduate Research Opportunities Program (UROP).
Trinh, Nam H.
Improving compression ratio on human-readable text by masking predictable characters.
Retrieved from the University of Minnesota Digital Conservancy,
Content distributed via the University of Minnesota's Digital Conservancy may be subject to additional license and use restrictions applied by the depositor.