Llms Cannot Reliably Identify And Reason About Security Vulnerabilities (yet?): A Comprehensive Evaluation, Framework, And Benchmarks | Awesome LLM Papers Add your paper to Awesome LLM Papers

Llms Cannot Reliably Identify And Reason About Security Vulnerabilities (yet?): A Comprehensive Evaluation, Framework, And Benchmarks

Saad Ullah, Mingji Han, Saurabh Pujar, Hammond Pearce, Ayse Coskun, Gianluca Stringhini . 2024 IEEE Symposium on Security and Privacy (SP) 2023 – 50 citations

[Paper]   Search on Google Scholar   Search on Semantic Scholar
Security

Large Language Models (LLMs) have been suggested for use in automated vulnerability repair, but benchmarks showing they can consistently identify security-related bugs are lacking. We thus develop SecLLMHolmes, a fully automated evaluation framework that performs the most detailed investigation to date on whether LLMs can reliably identify and reason about security-related bugs. We construct a set of 228 code scenarios and analyze eight of the most capable LLMs across eight different investigative dimensions using our framework. Our evaluation shows LLMs provide non-deterministic responses, incorrect and unfaithful reasoning, and perform poorly in real-world scenarios. Most importantly, our findings reveal significant non-robustness in even the most advanced models like PaLM2' and GPT-4’: by merely changing function or variable names, or by the addition of library functions in the source code, these models can yield incorrect answers in 26% and 17% of cases, respectively. These findings demonstrate that further LLM advances are needed before LLMs can be used as general purpose security assistants.

Similar Work