Backdoors in contemporary artificial intelligence: the good and the bad
Authors
Published Date
Publisher
Abstract
Recent decades have witnessed several tremendous developments of artificial intelligence (AI). From shallow feedforward neural networks to most recent billion-parameter transformers, these innovations have significantly changed almost every aspect of our life. However, extensive evidence has shown that these AI techniques are vulnerable to adversarial attacks, raising significant safety concerns about their use. Backdoor attacks have gained particular attention among various adversarial attacks against AI due to their stealthy nature. A backdoor attack refers to a generic class of attacks (from the attacker’s perspective) that manipulates an AI system, e.g., a deep neural network, to make targeted predictions on inputs of the attacker’s choosing. These attacks can be especially difficult to detect and defend against because the malicious behavior activates only in the presence of a specific backdoor trigger, while the AI system behaves as usual on normal inputs. This doctoral thesis delves into the realms of backdoor attacks and their peripheral fields in contemporary AI, exploring three key aspects: their fundamental mechanisms, defensive strategies against them, and their beneficial applications. • Understanding the fundamental mechanisms. In the first part of this thesis, our goal is to understand the fundamental mechanism behind backdoor attacks. To achieve this goal, we provide theoretical understanding towards when and how backdoor attacks against deep neural networks can be effective. In particular, we show that backdoor attacks will be more effective when backdoor triggers follow certain patterns. These insights pave the way for understanding why some empirical methods are more effective than others, and hence design more effective attacks. • Developing detection-based defenses. In the second part of this thesis, we focus on defending against backdoor attacks. Our goal is to develop a detection-based framework that (i) offers provable guarantees on detection performance and (ii) is applicable to both convolutional neural networks for image inputs and transformer-based architectures for text inputs. To this end, we first provide a mathematical understanding of the backdoor detection problem and then derive our framework based on these insights upon techniques from conformal prediction and deep representation learning. • Exploring beneficial applications. In the third part, we dive into the beneficial aspects of backdoor attacks. In particular, backdoors in the computer vision domain are often implemented as pixel patches. This form lends itself well to serving asa watermark for copyright protection—an increasingly important application in the era of generative AI. Building on this idea, we propose a robust and agile plug-and-play watermark detection framework, referred to as RAW. RAW achieves up to 100× improvement in watermark injection speed and significantly enhanced robustness against adversarial attacks, by leveraging a novel backdoor-motivated watermark scheme learning approach. Combined with techniques from the randomized smoothing literature, RAW provides provable guarantees on the false positive rate for misclassifying a watermarked image, even under adversarial attempts at watermark removal.
Keywords
Description
University of Minnesota Ph.D. dissertation. May 2025. Major: Electrical/Computer Engineering. Advisors: Jie Ding, Mingyi Hong. 1 computer file (PDF); xiv, 122 pages.
Related to
item.page.replaces
License
Collections
Series/Report Number
Funding Information
item.page.isbn
DOI identifier
Previously Published Citation
Other identifiers
Suggested Citation
Xian, Xun. (2025). Backdoors in contemporary artificial intelligence: the good and the bad. Retrieved from the University Digital Conservancy, https://hdl.handle.net/11299/275935.
Content distributed via the University Digital Conservancy may be subject to additional license and use restrictions applied by the depositor. By using these files, users agree to the Terms of Use. Materials in the UDC may contain content that is disturbing and/or harmful. For more information, please see our statement on harmful content in digital repositories.
