Automated Code-Behavior and -Semantic Understanding for Security

Wu, Qiushi2024-01-052024-01-052023-09https://hdl.handle.net/11299/259678University of Minnesota Ph.D. dissertation. September 2023. Major: Computer Science. Advisor: Kangjie Lu. 1 computer file (PDF); xvi, 194 pages.There has been a growing focus on strengthening program security to protect software ecosystems, especially in light of the swift expansion of available programs in the software supply chain. Static program analysis, embraced by both the industry and academia, allows for an in-depth examination of a program without executing it, making it pivotal in enhancing software security. Static program-analysis techniques delve deeply into various aspects of programs, whether at the source code, binary, or intermediate representation (IR) level. They can dissect data dependencies, control flow, type information, memory operations, cache activities, function calls, and more, which disclose the low-level semantics of a program. By harnessing this information, one can pinpoint security vulnerabilities, examine patches, or simulate the execution behavior of a program. The capabilities of static program analysis are rooted in the foundational principles of programming language and compiler theories. However, traditional static analysis also has shortcomings, particularly in grasping the high-level semantics of programs. For example, it struggles to extract complex programming logic rules, such as the privilege prerequisites for accessing specific variables or functions. Furthermore, when faced with a function, such as fread(), the static analysis cannot accurately interpret its high-level behavior—reading a file. However, understanding such high-level code behaviors is pivotal for in-depth analysis of the security facets of programs. For example, distinguishing between confidential and non-confidential data is crucial since each demands distinct privilege protection mechanisms. Recognizing such a difference necessitates a sophisticated grasp of the program’s high-level semantics. Consequently, bridging the gap between high-level code behaviors and low-level code semantics is imperative for bolstering the security of real-world programs. And over the past few years, we have done the following work to bridge this gap. Firstly, we utilized general behavioral rules of code, summarized with statistical methods, to minimize the reliance on high-level code semantics. Specifically, we introduced HERO, a system designed to detect Disordered Error Handling (DiEH) bugs. It operates on a fundamental programming principle: error cleanup functions should be invoked in a stack-like order. Leveraging this rule, HERO could pinpoint numerous error-handling related bugs, such as use-after-free, without tapping into the high-level semantics of programs. Our second work used security rules and formal definitions to analyze code behaviors. Specifically, we introduced SID to evaluate the security impacts bugs based on their corresponding patches. The driving concept behind SID is that both the impact of a patch and violations of security rules, such as out-of-bound access, can be framed as constraints solvable through automated methods. Consequently, SID can accurately distinguish between patches related to security and those unrelated to it. In this project, the high-level semantics of the code are extracted by human interpretation and later evaluated using formal methods. Besides these, we also leveraged machine learning (ML) models to decipher the behav- iors of functions semi-automatically. Specifically, we developed DiffCVSS to discern the correlation between functions and CVSS metrics by analyzing both function descriptions and vulnerability narratives. On the other hand, we employed GNNIC to probe the similarity among functions by scrutinizing their call graphs, function names, and utilized types, all with the assistance of graph neural networks. In these two projects, the high-level semantics of the code are summarized and analyzed using natural language processing techniques combined with machine learning methodologies.encode behaviorcode semanticprogram analysissecurityvulnerabilityAutomated Code-Behavior and -Semantic Understanding for SecurityThesis or Dissertation