LLMs in the SOC (Part 1) | Why Benchmarks Fail Security Operations Teams
LLM cybersecurity benchmarks fail to measure what defenders need: faster detection, reduced containment time, and better decisions under pressure.
LLMs in the SOC (Part 1) | Why Benchmarks Fail Security Operations Teams
Gabriel Bernadett-Shapiro & Edir Garcia Lazo
Executive Summary
- SentinelLABS’ analysis of benchmarks for LLM in cybersecurity, including those published by major players such as Microsoft and Meta, found that none measure what actually matters for defenders.
- Most LLM benchmarks test narrow tasks, but these map poorly to security workflows, which are typically continuous, collaborative, and frequently disrupted by unexpected changes.
- Models that excel at coding and math provide minimal direct gains on security tasks, indicating that general LLM capabilities do not readily translate to analyst-level thinking.
- All of today’s benchmarks use LLMs to evaluate other LLMs, often using the same vendor’s models for both, creating a closed loop that is more susceptible to gaming, and difficult to trust.
- As frontier labs push defenders to rely on models to automate security operations, the importance of benchmarks will increase drastically as the main mechanism to evaluate whether the capabilities of the models match the vendor’s claims.
For security teams, AI promised to write secure code, identify and patch vulnerabilities, and replace monotonous security operations tasks. Its key value proposition was raising costs for adversaries while lowering them for defenders.
To evaluate whether Large Language Models were both performant and reliable enough to be deployed into the enterprise, a wave of new benchmarks were created. In 2023, these early benchmarks largely comprised multiple-choice exams over clean text, which produced clean and reproducible metrics for performance. However, as the models improved they outgrew the early tests: scores across models began to converge at the top of the scale as the benchmarks became increasingly “saturated”, and the tests themselves ceased telling anything meaningful.
As the industry has boomed over the past few years, benchmarking has become a way to distinguish new models from older ones. Developing a benchmark that shows how a smaller model outperforms a larger one released from a frontier AI lab is a billion-dollar industry, and now every new model launches with a menagerie of charts with bold claims. +3.7 on SomeBench-v2, SOTA on ObscureQA-XL, or 99th percentile on an-exam-no-one-had-heard-of-last-week. The subtext here is simple: look at the bold numbers, be impressed, and please join our seed round!
Inside this swamp of scores and claims, security teams are somehow meant to conclude that a system is safe enough to trust with an organization’s business, its users, and maybe even its critical infrastructure. However, a careful read through the arxiv benchmark firehose reveals a hard-to-miss pattern: We have more benchmarks than ever, and somehow we are still not measuring what actually matters for defenders.
So what do security benchmarks actually measure? And how well does this approach map to real security work?
In this post, we review four popular LLM benchmarking evaluations: Microsoft’s ExCyTIn-Bench, Meta’s CyberSOCEval and CyberSecEval 3, and Rochester Institute’s CTIBench. We explore what we think these benchmarks get right and where we believe they fall short.
What Current Benchmarks Actually Measure
ExCyTIn-Bench | Realistic Logs in a Microsoft Snow Globe
ExCyTIn-Bench was the cleanest example of an “agentic” Security Operations benchmark that we reviewed. It drops LLM agents into a MySQL instance that mirrors a realistic Microsoft Azure tenant. They provide 57 Sentinel-style tables, 8 distinct multi-stage attacks, and a unified log stream spanning 44 days of activity.
Each question posed to the LLM agent is anchored to an incident graph path. This means that the agent must discover the schema, issue SQL queries, pivot across entities, and eventually answer the question. Rewards for the agent are path-aware, meaning that full credit is assigned for the right answer, but the agent could also earn partial credit for each correct intermediate step that it took.
The headline result is telling:
“*Our comprehensive experiments with different models confirm the difficulty of the task: with the base setting, the average reward across all evaluated models is 0.249, and the best achieved is 0.368…*” (arxiv)
Microsoft’s ExCyTIn benchmark demonstrates that LLMs struggle to plan multi-hop investigations over realistic, heterogeneous logs.
This is an important finding – especially for those who are concerned with how LLMs work in real world scenarios. Moreover, all of this takes place in a Microsoft snow globe: one fictional Azure tenant, eight well-studied, canned attacks and clean tables and curated detection logic for the agent to work with. Although the realistic agent setup is a massive improvement over trivia-style Multiple Choice Question (MCQ) benchmarks, it is not the daily chaos of real security operations.
CyberSOCEval | Defender Tasks Turned into Exams
CyberSOCEval is part of Meta’s CyberSecEval 4 and deliberately picks two tasks defenders care about: malware analysis over real sandbox detonation logs and threat Intelligence reasoning over 45 CTI reports. The authors open with a statement we very much agree with:
“*This lack of informed evaluation has significant implications for both AI developers and those seeking to apply LLMs to SOC automation. Without a clear understanding of how LLMs perform in real-world security scenarios, AI system developers lack a north star to guide their development efforts, and users are left without a reliable way to select the most effective models.*” (arxiv)
To evaluate these tasks, the benchmark frames them as multi-answer multiple-choice questions and incorporates analytically computed random baselines and confidence intervals. This setup gives clean, statistically grounded comparisons between models and reduces complex workflows into simplified questions. Researchers found that the models perform far above random, but also far from solved.
In the malware analysis trial, they score exact-match accuracy in the teens to high-20s percentage range versus a random baseline around 0.63%. For threat-intel reasoning, models land in the ~43 to 53% accuracy band versus ~1.7% random.
In other words, the models are clearly extracting meaningful signals from real logs and CTI reports. However, the models also are failing to correctly answer most of the malware questions and roughly half of the threat intelligence questions.
These findings suggest that for any system aimed at automating SOC workflows, model performance should be evaluated as assistive rather than autonomous.
Crucially, they find that test-time “reasoning” models don’t get the same uplift they see in math/coding:
“*We also find that reasoning models leveraging test time scaling do not achieve the boost they do in areas like coding and math, suggesting that these models have not been trained to reason about cybersecurity analysis…*” (arxiv)
That’s a big deal, and it’s evidence that you don’t get generalized security reasoning for free just by cranking up “thinking steps”.
Meta’s CyberSOCEval falls short because it compresses two complex domains into MCQ exams. There is no notion of triaging multiple alerts or asking follow-up questions or hunting down log sources. In real life, analysts need to decide when to stop and escalate or switch paths.
[...]