Srivastava, AnuragSingh, VineetHan, Eui-HongKumar, Vipin2020-09-022020-09-021997https://hdl.handle.net/11299/215295Classification is an important data mining problem. Recently, there has been significant interest in classification using training datasets that are large enough that they do not fit in main memory and need to be disk-resident. Although training data can be reduced by sampling, it has been shown that it can be advantageous to use the entire training dataset since that can increase accuracy. Most current algorithms are unsui:table for large disk-resident datasets because their space and time complexities (including I/0) are prohibitive. A recent algorithm called SPRINT promises to alleviate some of the data size restrictions. We present a new algorithm called SPEC that provides similar accuracy, reduces I/0, reduces memory requirements, and improves scalability (time and space) on both sequential and parallel computers. We provide some theoretical results as well as experimental results on the IBM SP2.en-USdata miningparallel processingclassificationscalabilitydecision treesAn Efficient, Scalable, Parallel Classifer for Data MiningReport