The Fundamentals Of Designing Autonomy Evaluations For AI Safety

Originally published on Forbes

When people ask about my company, I usually give them a vague answer: ‘We evaluate language models.’ This is not a great description and rarely sparks interest - people assume we’re another player in the crowded field of AI evaluations. Perhaps I’m breaking the cardinal rule of Silicon Valley to ‘always be pitching’. LLM evals are dominated by two sub categories. The first evaluates raw model capabilities through question-answer tests - imagine giving the AI a standardized test and scoring its answers. Perhaps you have heard of benchmarks such as MMLU. The second approach evaluates AI-powered applications built by companies for specific industries like law or recruiting. Our work doesn’t fit into either category. Instead, we make autonomy evals, measuring an AI’s ability to operate independently in open-ended computer environments. Unlike traditional benchmarks, our approach tests how well AI systems can navigate, reason, and accomplish goals autonomously - without task-specific guidance or constraints. Autonomy capabilities are of great interest to AI safety because many threat models involve AI’s acting autonomously. I find this work really rewarding, it is technically difficult and contributing to something meaningful. If this sounds interesting, here are some tips for how to build good evals. Find your niche The demand for these evals is bigger than the supply, but it is only really good evals that are looked after. I therefore suggest that you pick a niche and make sure you are the best in the world at making evals for this niche. For example, METR picked the “AI RnD”-niche and Apollo deception.

Automatic scoring

Evaluations like these can take a long time to run, either because the tasks require many steps or because the agent interacts with slow tools. Consequently, we run numerous tasks in parallel, making it impractical to monitor every step of each agent. The best solution is to have the agent produce a final deliverable that can be automatically scored. For instance, the task might be to build an ML classifier—a process that could take hours—but the final output is simply a list of the classifier’s predictions on a test set. This output can be easily compared against the correct answers (ground truth) for automatic scoring.

LLM scoring

In cases where producing such deliverables isn’t possible—for example, when you’re interested in evaluating the style of the agent’s approach—you might need different assessment methods. While you can sometimes use fixed rules (deterministic heuristics) to evaluate, often the best approach is to have a language model (a judge-LLM) make the assessment. However, achieving consistent scoring across different runs can be challenging. Judge-LLMs are good at ranking examples, but ensuring that they are calibrated across samples isn’t straightforward. To address this, use a very specific scoring rubric and provide calibration examples where you specify the expected score for a given example.

Baselines

Often, your scoring metric doesn’t provide absolute performance indicators. You might run two different agents, and while the scoring may indicate which one performs better, you still might not know if either is actually good. In this case, grounded baselines are helpful. Have a human complete the task to see what score they achieve. You can also establish baselines using standardized agent frameworks. Involving a human not only provides a performance benchmark but also helps identify bugs in your implementation. Additionally, timing how long it takes a human offers insights into the task’s significance. If an agent can automate a task that requires many hours from a human, such automation could have a substantial societal impact

Contamination

The goal of these agentic evaluations is to measure how capable models are at acting autonomously. Ideally, the problem is unique so the model hasn’t simply memorized the solution from its training data. For example, asking the model to create an MNIST classifier isn’t a good eval because numerous tutorials exist online, and the model likely has them in its training data. Assessing how much contamination your task has is difficult, but you develop intuition over time. A good starting point is to extensively search the problem online. Another method is to see if the model can solve the task immediately without any iterations (zero-shot). If it can, the task is likely contaminated. Keep in mind that this is a moving target; a task might be uncontaminated when you create it, but as new content appears online, this could change.

Eval difficulty

The best tasks of this kind are very challenging for the agent. Since our purpose is to measure the cutting edge, the tasks must allow for demonstrating truly impressive capabilities. Additionally, the task should ideally remain relevant for the next generation of models. While current models may not pose significant risks, future models—or those currently in development—might. To be genuinely valuable for AI safety, the eval score need to be finegrained at these high levels of capability.

Sub tasks

While making our evaluations difficult helps in assessing whether models could be dangerous, we also want to use these evaluations to predict when that might occur. To do this, our evaluations need to provide meaningful feedback at current model capabilities. Ideally, the evaluation should offer a fine-grained, continuous score where current models score low, leaving ample room for improvement. One approach is to create subtasks by breaking the hard task into smaller steps. You can design it so that a subtask can start from the endpoint of another, generating a multitude of task combinations. Another method is to provide a list of hints and observe how many the agent requires to complete the task. The downside of these approaches is that they assume there’s a ‘correct solution.’ If we want to truly measure excellence, the task should be open-ended enough to allow for creative solutions we haven’t envisioned.

Proxies

Often, the capability you’re interested in can’t be measured directly. In such cases, you need to create a task that serves as a proxy or a specific example of that capability. For instance, if you want to investigate whether models can manage resources, evaluating how they play Monopoly can measure this to some extent, but it’s only a stand-in for managing real-world resources. Similarly, if you want to assess whether models are good at ML research, you might design an evaluation where the model has to implement a single ML task. While this doesn’t fully answer whether models excel at ML research, a single example can be a useful proxy. It’s important to carefully consider whether your proxy effectively represents the capability you’re aiming to measure

Follow me on X or subscribe via RSS or Substack to stay updated.