
The Operational Evaluation Framework for Cyber Security Risks in AI (OCCULT) is a pioneering methodology developed by MITRE to assess the potential risks posed by large language models (LLMs) in offensive cyber operations (OCO). As AI technology advances, there is an increasing concern about its misuse in executing sophisticated cyberattacks. The OCCULT Framework aims to provide a standardized approach for evaluating the capabilities of AI systems in autonomously executing or assisting in cyberattacks. Here’s an elaborate analysis of the framework, its components, and its implications:
Key Components of the OCCULT Framework
1. OCO Capability Areas
- Alignment with MITRE ATT&CK® Tactics: The framework evaluates LLMs against various tactics outlined in the MITRE ATT&CK framework. These tactics include lateral movement, privilege escalation, and credential access, ensuring that the evaluation covers a comprehensive range of offensive cyber capabilities.
2. LLM Use Cases
- Knowledge Assistants: This use case evaluates LLMs as tools that enhance human capabilities by providing relevant information and recommendations. The focus is on how well the models can assist cybersecurity professionals in making informed decisions during cyber operations.
- Co-Orchestration Agents: In this scenario, LLMs collaborate with platforms like Caldera™ to execute coordinated cyber operations. The evaluation assesses the models’ ability to work alongside other tools and systems to achieve strategic objectives.
- Autonomous Operators: This use case tests the ability of LLMs to independently plan and execute cyberattacks without human intervention. It examines the models’ decision-making processes, adaptability, and effectiveness in achieving their goals.
3. Reasoning Power
- Planning and Environmental Perception: This component measures the model’s ability to understand and navigate complex environments. It evaluates how well the models can develop strategic plans based on the information available.
- Action Iteration and Task Generalization: This aspect assesses the model’s capacity to adapt to evolving network defenses and generalize tasks across different scenarios. It focuses on the models’ ability to iteratively refine their actions and strategies to achieve their objectives.
Core Test Cases
1. Threat Actor Competency Test for LLMs (TACTL)
- Scenario-Based Multiple-Choice Benchmark: This test evaluates the models’ knowledge of 44 ATT&CK techniques through a series of 30 questions. Each question involves dynamic variables, preventing the models from simply memorizing answers.
- Preliminary Results: The DeepSeek-R1 model demonstrated exceptional performance, achieving 100% accuracy and outperforming other models like Mixtral 8x22B. This indicates the model’s advanced understanding of offensive cyber techniques.
2. Synthetic Active Directory Environments
- Graph-Based Analysis: In this test, LLMs are challenged to match the analysis performed by tools like BloodHound in identifying attack paths within Active Directory environments. Models like Llama 3.1-405B were able to identify 52.5% of high-value targets but faced challenges with complex queries, highlighting areas for improvement in real-world scenarios.
3. High-Fidelity Network Emulations
- Autonomous Decision-Making: This test evaluates the models’ ability to autonomously navigate high-fidelity network environments while avoiding detection. In the Worm Scenario, Llama 3.1-70B demonstrated efficient lateral movement and objective completion but exhibited noisy behavior, triggering more alerts than human operators. This underscores the need for further refinement in the models’ stealth capabilities.
Key Findings and Implications
DeepSeek-R1 Proficiency
- Outstanding Performance: The DeepSeek-R1 model showcased unprecedented proficiency in offensive cyber operations, solving 91.8% of the TACTL-183 challenges. This positions DeepSeek-R1 as a highly capable tool in the realm of cyber defense and offense.
Performance Improvements
- Meta’s Llama and Mistral’s Mixtral Models: These models demonstrated significant improvements over earlier versions, particularly in simulations of both offensive and defensive cyber operations. Their enhanced capabilities underscore the rapid advancements in AI technology and its potential applications in cybersecurity.
Broader Implications
Security Concerns
- Potential Misuse of AI: The advanced capabilities of LLMs in executing or assisting in cyberattacks raise concerns about their potential misuse by malicious actors. The OCCULT Framework highlights the need for robust security measures and ethical considerations in the development and deployment of AI technologies.
Standardization of Evaluation
- Benchmark for Assessing AI Risks: The OCCULT Framework provides a standardized and rigorous methodology for assessing the risks associated with AI systems in cyber operations. This benchmark helps cybersecurity experts better understand and mitigate the potential threats posed by these technologies.
Final Thoughts
The OCCULT Framework represents a significant advancement in the evaluation of cybersecurity risks associated with AI and large language models. By providing a comprehensive and standardized approach, the framework aids in understanding the capabilities and limitations of AI systems in offensive cyber operations. This knowledge is crucial for developing effective mitigation strategies and ensuring the ethical use of AI in cybersecurity.

