Structured Online Learning with Full and Bandit Information

Loading...
Thumbnail Image

Persistent link to this item

Statistics
View Statistics

Journal Title

Journal ISSN

Volume Title

Title

Structured Online Learning with Full and Bandit Information

Published Date

2016-09

Publisher

Type

Thesis or Dissertation

Abstract

Numerous problems require algorithms to repeatedly interact with the environment which may include humans, physical obstacles such as doors or walls, biological processes, or even other algorithms. One popular application domain where such repeated interaction is necessary is in social media. Every time a user uses a social media application, an algorithm must make a series of decisions on what is shown ranging from news content to friend recommendations to trending topics. After which, the user provides feedback frequently in the form of a click or no click. The algorithm must use such feedback to learn the user's likes and dislikes. Similar scenarios play out in medical treatments, autonomous robot exploration, online advertising, and algorithmic trading. In such applications, users often have high expectations of the algorithm such as immediately showing relevant recommendations, quickly administering effective treatments, or performing profitable trades in fractions of a second. Such demands require algorithms to have the ability to learn in a dynamic environment, learn efficiently, and provide high quality solutions. Designing algorithms which meet such user demands poses significant challenges for researchers. In this thesis, we design and analyze machine learning algorithms which interact with the environment and specifically study two aspects which can help alleviate challenges: (1) learning online where the algorithm selects an action after which it receives the outcome (i.e., loss) of selecting such an action and (2) using the structure (sparsity, group sparsity, low-rankness) of a solution or user model. We explore such aspects under two feedback models: full and bandit information. With full information feedback, the algorithm observes the loss of each possible action it could have selected, for example, a trading algorithm can observe the price of each stock it could have invested in at the end of the day. With bandit information feedback, the algorithm can only observe the loss of the action taken, for example, a medical treatment algorithm can only observe whether a patient's health improved for the treatment provided. We measure the performance of our algorithms by their regret which is the difference between the cumulative loss received by the algorithm and the cumulative loss received by the best fixed or time-varying actions in hindsight. In the first half of this thesis, we focus on full information settings and study online learning algorithms for general resource allocation problems motivated, in part, by applications in algorithmic trading. The first two topics we explore are controlling the cost of updating an allocation, for example, one's stock portfolio, and learning to allocate a resource across groups of objects such as stock market sectors. In portfolio selection, making frequent trades may incur huge amounts of transaction costs and hurt one's bottom line. Moreover, groups of stocks, may perform similarly and investing in a few groups may lead to higher returns. We design and analyze two efficient algorithms and present new theoretical regret bounds. Further, we experimentally show the algorithms earn more wealth than existing algorithms even with transaction costs. The third and fourth topics we consider are two different ways to control suitable measures of risk associated with a resource allocation. The first approach is through diversification and the second is through the concept of hedging where, in the application of portfolio selection, a trader borrows shares from the bank and holds both long and short positions. We design and analyze two efficient online learning algorithms which either diversify across groups of stocks or hedge between individual stocks. We establish standard regret bounds and show experimentally our algorithms earn more wealth, in some cases orders of magnitude more, than existing algorithms and incur less risk. In the second half of this thesis, we focus on bandit information settings and how to use the structure of a user model to design algorithms with theoretically sharper regret bounds. We study the stochastic linear bandit problem which generalizes the widely studied multi-armed bandit. In the multi-armed bandit, an algorithm repeatedly selects an arm (action) from a finite decision set after which it receives a stochastic loss. In the stochastic linear bandit, arms are selected from a decision set with infinitely many arms (e.g., vectors from a compact set) and the loss is a stochastic linear function parameterized by an unknown vector. The first topic we explore is how the regret scales when the unknown parameter is structured (e.g., sparse, group sparse, low-rank). We design and analyze an algorithm which uses the structure to construct tight confidence sets which contain the unknown parameter with high-probability which leads to sharp regret bounds. The second topic we explore is how to generalize the previous algorithm to non-linear losses often used in Generalized Linear Models. We design and analyze a similar algorithm and show the regret is of the same order as with linear losses.

Keywords

Description

University of Minnesota Ph.D. dissertation. September 2016. Major: Computer Science. Advisor: Arindam Banerjee. 1 computer file (PDF); xi, 172 pages.

Related to

Replaces

License

Collections

Series/Report Number

Funding information

Isbn identifier

Doi identifier

Previously Published Citation

Suggested citation

Johnson, Nicholas. (2016). Structured Online Learning with Full and Bandit Information. Retrieved from the University Digital Conservancy, https://hdl.handle.net/11299/183364.

Content distributed via the University Digital Conservancy may be subject to additional license and use restrictions applied by the depositor. By using these files, users agree to the Terms of Use. Materials in the UDC may contain content that is disturbing and/or harmful. For more information, please see our statement on harmful content in digital repositories.