Using a Low-Memory Factored Representation to Data Mine Large Data Sets

2006-05-03
Loading...
Thumbnail Image

View/Download File

Persistent link to this item

Statistics
View Statistics

Journal Title

Journal ISSN

Volume Title

Title

Using a Low-Memory Factored Representation to Data Mine Large Data Sets

Published Date

2006-05-03

Publisher

Type

Report

Abstract

As data sets have continued to grow in size, it has become more difficult to mine them efficiently. A data set may be dynamic, or may be static but very large. The data could be arriving in a stream, such that the total number of data points is finite and unknown or infinite. Many data mining methods require multiple passes over the data, which means that the data must reside in memory to keep costs low. One alternative to mining the original data is to create a representation of the data set which occupies less memory, while permitting the incorporation of new data. To this end, we present the Low-Memory Factored Representation (LMFR). The LMFR is an alternate representation of the data which uses less memory, while maintaining a unique representation for each data point. Since the LMFR provides a general representation of the data, the data mining task does not need to be determined before the LMFR is constructed. For streaming data, the LMFR allows for many more data points to be exposed a given method at once. Furthermore, the LMFR can be re-computed on the fly to lower its memory footprint without re-examining the original data. We demonstrate experimentally that the LMFR can be an efficient replacement for the original data when clustering using Principal Direction Divisive Partitioning (PDDP). This new clustering method, Piecemeal PDDP (PMPDDP), maintains the scalability and quality of a PDDP clustering while extending PDDP to much larger data sets. We demonstrate experimentally that the LMFR can be used for document retrieval. A given LMFR can achieve retrieval accuracy superior to the truncated singular value decomposition of a given rank, while being less expensive to compute and occupying less memory. We also develop an application of the LMFR to streaming data mining, and show experimentally the cost of various techniques which lower the memory footprint of the LMFR. The overall results indicate that the LMFR has many possible applications in general data mining tasks.

Keywords

Description

Related to

Replaces

License

Series/Report Number

Funding information

Isbn identifier

Doi identifier

Previously Published Citation

Suggested citation

Littau, David. (2006). Using a Low-Memory Factored Representation to Data Mine Large Data Sets. Retrieved from the University Digital Conservancy, https://hdl.handle.net/11299/215700.

Content distributed via the University Digital Conservancy may be subject to additional license and use restrictions applied by the depositor. By using these files, users agree to the Terms of Use. Materials in the UDC may contain content that is disturbing and/or harmful. For more information, please see our statement on harmful content in digital repositories.