Llms Cannot Reliably Identify And Reason About Security Vulnerabilities (yet?): A Comprehensive Evaluation, Framework, And Benchmarks

Saad Ullah, Mingji Han, Saurabh Pujar, Hammond Pearce, Ayse Coskun, Gianluca Stringhini . 2024 IEEE Symposium on Security and Privacy (SP) 2024 – 43 citations

[Paper]
Compositional Generalization Evaluation Interdisciplinary Approaches Model Architecture Multimodal Semantic Representation Privacy Security Tools

Large Language Models (LLMs) have been suggested for use in automated vulnerability repair, but benchmarks showing they can consistently identify security-related bugs are lacking. We thus develop SecLLMHolmes, a fully automated evaluation framework that performs the most detailed investigation to date on whether LLMs can reliably identify and reason about security-related bugs. We construct a set of 228 code scenarios and analyze eight of the most capable LLMs across eight different investigative dimensions using our framework. Our evaluation shows LLMs provide non-deterministic responses, incorrect and unfaithful reasoning, and perform poorly in real-world scenarios. Most importantly, our findings reveal significant non-robustness in even the most advanced models like PaLM2' and GPT-4’: by merely changing function or variable names, or by the addition of library functions in the source code, these models can yield incorrect answers in 26% and 17% of cases, respectively. These findings demonstrate that further LLM advances are needed before LLMs can be used as general purpose security assistants.

Awesome LLM Papers

Stay Updated

Llms Cannot Reliably Identify And Reason About Security Vulnerabilities (yet?): A Comprehensive Evaluation, Framework, And Benchmarks

Saad Ullah, Mingji Han, Saurabh Pujar, Hammond Pearce, Ayse Coskun, Gianluca Stringhini . 2024 IEEE Symposium on Security and Privacy (SP) 2024 – 43 citations

Similar Work