Toward Automating and Systematizing the Use of Domain Knowledge in Feature Selection

Constructing prediction models for real-world domains often involves practical complexities that must be addressed to achieve good prediction results. Often, there are too many sources of data (features). Limiting the set of features in the prediction model is essential for good performance, but prediction accuracy may be degraded by the inadvertent removal of relevant features. The problem is even more acute in situations where the number of training instances is limited, as limited sample size and domain complexity are often attributes of real-world problems. This thesis explores the practical challenges of building regression models in large multivariate time-series domains with known relationships between variables. Further, we explore the conventional wisdom related to preparing datasets for model calibration in machine learning, and discuss best practices for learning time-varying concepts from data. The core contribution of this work is a novel wrapper-based feature selection framework called Developer-Guided Feature Selection (DGFS). It systematically incorporates domain knowledge for domains characterized by a large number of observable features. The observable features may be related to each other by logical, temporal, or spatial relationships, some of which are known to the model developer a priori. The approach relies on limited domain-specific knowledge but can replace or improve upon more elaborate domain specific models and on fully automated feature selection for many applications. As a wrapper-based approach, DGFS can augment existing multivariate techniques used in high-dimensional domains to produce improved modeling results particularly in situations where the volume of training data is limited. We demonstrate the viability of our method in several complex domains (natural and synthetic) that have significant temporal aspects and many observable features.

Keywords

Airline Ticket Pricing

Applied Machine Learning

Description

University of Minnesota Ph.D. dissertation. August 2015. Major: Computer Science. Advisor: Maria Gini. 1 computer file (PDF); xi, 185 pages.

Collections

Dissertations

Suggested citation

Groves, William. (2015). Toward Automating and Systematizing the Use of Domain Knowledge in Feature Selection. Retrieved from the University Digital Conservancy, https://hdl.handle.net/11299/175444.

Content distributed via the University Digital Conservancy may be subject to additional license and use restrictions applied by the depositor. By using these files, users agree to the Terms of Use. Materials in the UDC may contain content that is disturbing and/or harmful. For more information, please see our statement on harmful content in digital repositories.

University Digital Conservancy

Toward Automating and Systematizing the Use of Domain Knowledge in Feature Selection

View/Download File

Persistent link to this item

Statistics

Journal Title

Journal ISSN

Volume Title

Title

Alternative title

Authors

Published Date

Publisher

Type

Abstract

Keywords

Description

Related to

Replaces

License

Collections

Series/Report Number

Funding information

Isbn identifier

Doi identifier

Previously Published Citation

Other identifiers

Suggested citation

University Digital Conservancy

University of Minnesota Twin Cities

Toward Automating and Systematizing the Use of Domain Knowledge in Feature Selection

View/Download File

Persistent link to this item

Statistics

Journal Title

Journal ISSN

Volume Title

Title

Alternative title

Authors

Published Date

Publisher

Type

Abstract

Keywords

Description

Related to

Replaces

License

Collections

Series/Report Number

Funding information

Isbn identifier

Doi identifier

Previously Published Citation

Other identifiers

Suggested citation