Bridging Visual Perception and Reasoning: A Visual Attention Perspective

One of the fundamental goals of Artificial Intelligence (AI) is to develop visual systems that can reason with the complexity of the world. Advances in machine learning have revolutionized many fields in computer vision, achieving human-level performance among several benchmark tasks and industrial applications. While the performance gap between machines and humans seems to be closing, the recent debates on the discrepancies between machine and human intelligence have also received a considerable amount of attention. Studies argue that existing vision models tend to use tactics different from human perception, and are vulnerable to even a tiny shift in visual domains. Evidence also suggests they commonly exploit statistical priors, instead of genuinely reasoning on the visual observations, and have yet to develop the capability to overcome issues resulting from spurious data biases. These contradictory observations strike the very heart of AI research, and bring attention to the question: How can AI systems understand the comprehensive range of visual concepts and reason with them to accomplish various real-life tasks, as we do on a daily basis? Humans learn much from little. With just a few relevant experiences, we are able to adapt to different situations. We also take advantage of inductive biases that can easily generalize, and avoid distraction from all kinds of statistical biases. The innate generalizability is a result of not only our profound understanding of the world but also the ways we perceive and reason with visual information. For instance, unlike machines that develop holistic understanding by scanning through the whole visual scene, humans prioritize their attention with a sequence of eye fixations. Guided by visual stimuli and the structured reasoning process, we progressively locate the regions of interest, and understand their semantic relationships as well as connections to the overall task. Despite the lack of comprehensive understanding of human vision, research on humans' visual behavior can provide abundant insights into the developments of vision models, and have the potential of contributing to AI systems that are practical for real-world scenarios. With an overarching goal of building visual systems with human-like reasoning capability, we focus on understanding and enhancing the integration between visual perception and reasoning. We leverage visual attention as an interface for studying how humans and machines prioritize their focus when reasoning with diverse visual scenes. We tackle the challenges by making progress from three distinct perspectives: From the visual perception perspective, we study the relationship between the accuracy of attention and the performance related to visual understanding; From the reasoning perspective, we pay attention to the connections between reasoning and visual perception, and study the roles of attention throughout the continuous decision-making process; Humans not only capture and reason on important information with high accuracy, but can also justify their rationales with supporting evidence. From the perspective of explainability, we explore the use of multi-modal explanations for justifying the rationales behind models' decisions. Our efforts provide an extensive collection of observations for demystifying the integration between perception and reasoning, and more importantly, they offer insights into the development of trustworthy AI systems with the help of human vision.

Keywords

Computer Vision

Human Vision

Machine Learning

Visual Attention

Description

University of Minnesota Ph.D. dissertation. June 2023. Major: Computer Science. Advisor: Catherine (Qi) Zhao. 1 computer file (PDF); xii, 135 pages.

Collections

Dissertations

Suggested citation

Chen, Shi. (2023). Bridging Visual Perception and Reasoning: A Visual Attention Perspective. Retrieved from the University Digital Conservancy, https://hdl.handle.net/11299/258712.

Content distributed via the University Digital Conservancy may be subject to additional license and use restrictions applied by the depositor. By using these files, users agree to the Terms of Use. Materials in the UDC may contain content that is disturbing and/or harmful. For more information, please see our statement on harmful content in digital repositories.

University Digital Conservancy

Bridging Visual Perception and Reasoning: A Visual Attention Perspective

View/Download File

Persistent link to this item

Statistics

Journal Title

Journal ISSN

Volume Title

Title

Authors

Published Date

Publisher

Type

Abstract

Keywords

Description

Related to

Replaces

License

Collections

Series/Report Number

Funding information

Isbn identifier

Doi identifier

Previously Published Citation

Other identifiers

Suggested citation

University Digital Conservancy

University of Minnesota Twin Cities

Bridging Visual Perception and Reasoning: A Visual Attention Perspective

View/Download File

Persistent link to this item

Statistics

Journal Title

Journal ISSN

Volume Title

Title

Authors

Published Date

Publisher

Type

Abstract

Keywords

Description

Related to

Replaces

License

Collections

Series/Report Number

Funding information

Isbn identifier

Doi identifier

Previously Published Citation

Other identifiers

Suggested citation