AutoLibra

Agent Metric Induction from Open-Ended Human Feedback

Hao Zhu1, Phil Cuvin2, Xinkai Yu3, Charlotte Ka Yee Yan1, Jason Zhang1, Diyi Yang1
1Stanford University  ·  2University of Toronto  ·  3University of Pennsylvania

TL;DR. Task success is a blunt instrument for evaluating AI agents. AutoLibra turns informal natural-language feedback — from end-users or expert annotators — into concrete, fine-grained metrics that diagnose why agents succeed or fail, and serve as optimization targets that improve agents by >20% success rate with only 18 annotated trajectories per stage.

At a glance

From a sentence of feedback to a fleet of metrics

A user watches a web agent shop for a phone and writes: "the agent did not choose iPhone 14/15." AutoLibra grounds that aspect to the agent's action (selecting iPhone 16 Pro from a drop-down), clusters it with similar behaviors across trajectories, and distills a reusable metric — Element Interaction Accuracy.

AutoLibra pipeline: human feedback is grounded to agent behaviors, clustered into aspects, distilled into metrics, and used by an LLM-as-a-Judge to evaluate trajectories. The coverage and redundancy of the induced metrics on held-out feedback then meta-evaluate the metrics themselves.
Figure 1. AutoLibra induces agent-evaluation metrics from human feedback, uses them to evaluate agents, and meta-evaluates the metrics via their coverage of unseen feedback. Example shown on WebVoyager [He et al., 2024].
Motivation

Task success isn't enough

Agents today are primarily evaluated by coarse task-success metrics that experts hand-design up front. Those metrics miss why an agent fails, overlook emergent behaviors, and don't scale to new domains. On the other hand, humans readily describe what went well or poorly — "If you find that the button is disabled, don't click it again," or "this agent has too much autonomy."

AutoLibra closes this gap: it treats free-form feedback as the signal, and the metrics themselves as the output. Inspired by thematic analysis in social sciences, the pipeline grounds each aspect of feedback to concrete agent behaviors, then clusters them into a minimal set of reusable, interpretable metrics — no per-task metric design required.

Method

A closed loop: induce, evaluate, meta-evaluate

AutoLibra is a closed-loop pipeline. An induction process converts agent trajectories and open-ended feedback into metrics. An evaluation process applies those metrics with an LLM-as-a-Judge, then meta-evaluates the metrics by measuring their coverage and redundancy against unseen feedback.

1

Feedback grounding

Break down each piece of feedback into aspects — triples of (behavior, feedback, sign) — each pointing to a specific part of the agent trajectory.

Feedback grounding diagram
2

Behavior clustering

Cluster similar aspects with an LLM; each cluster becomes a metric with a definition plus positive and negative behavior examples.

Behavior clustering diagram
3

LLM-as-a-Judge

An LLM rates each agent trajectory on every induced metric with {+1, −1, N/A}, producing positive and negative traits.

LLM-as-a-Judge diagram
4

Meta-evaluation

Match traits to aspects on unseen feedback. Coverage = fraction of aspects matched; redundancy = fraction of traits unmatched.

Meta-evaluation diagram
Optimization loop: induction process produces metrics; evaluation process scores coverage and redundancy; these two objectives are optimized iteratively.
Figure 2. Metric optimization: search for the metric set that maximizes coverage while minimizing redundancy.
Lens

AutoLibra as a lens for agent behavior

Across four diverse agent domains — collaborative (CoGym), social (Sotopia), web (WebArena, WebVoyager) — AutoLibra induces metrics that are more concrete than expert-designed categories and surfaces novel evaluation dimensions.

Coverage vs. redundancy Pareto curves on four datasets (CoGym, Sotopia, WebArena, WebVoyager). Circles: candidate metric sets. Stars: best metrics on held-out feedback. Squares: ablation removing behavior examples, which collapses coverage.
Figure 3. Coverage and redundancy of AutoLibra metrics on four agentic datasets. Stars mark the best metrics on held-out feedback; squares show the ablation where positive/negative behavior examples are removed — coverage drops by up to 30%.
88%

Coverage on WebArena and WebVoyager, with held-out performance within 5% of the induction set.

>85%

Human-validated agreement on grounding, LLM-as-a-Judge, and meta-evaluation steps across five datasets.

+novel

Metrics like Negotiation Tactics (Sotopia) and Query Search Strategy (WebVoyager) are missed by expert frameworks.

On CoGym, AutoLibra not only recovers the five failure categories proposed by the authors — it decomposes Communication into Responsiveness & Efficiency and Communication Clarity, revealing that a single expert label was concealing two distinct behaviors.

Ladder

AutoLibra as a ladder for agent improvement

AutoLibra-induced metrics aren't just diagnostic — they're optimization targets. On the challenging 2D game Baba-Is-AI, 3 stages of iterative feedback (only 18 trajectory annotations per stage) drive >20% absolute improvement on task success — without ever optimizing for success rate directly.

Running maximum of metric scores and task success rate across three stages of iterative AutoLibra optimization on Baba-Is-AI. Success rate rises steadily until Stage 3, when the agent begins to overthink.
Figure 4. Iterative metric induction drives continuous improvement on Baba-Is-AI. Success rate keeps rising even though only metric scores are optimized.
Example Baba-Is-AI task that requires self-referential rule manipulation.
Figure 5. A Baba-Is-AI task: change the rule "baba is you""door is you", form "ball is win," and navigate to the red ball.

What the agent learned, stage by stage

  1. 1 Reading the map and finding rules to form, guided by map-n-constraint-recognition.
  2. 2 Assembling new win conditions, guided by rule-manipulation-proficiency.
  3. 3 Handling self-referential rule changes — a metacognitive skill that frontier LLMs struggle with.
Pipeline

The full AutoLibra pipeline

End-to-end AutoLibra pipeline diagram, from human feedback collection through grounding, clustering, LLM-as-a-Judge, and meta-evaluation.
Citation

Cite AutoLibra

If you use AutoLibra in your research, we'd appreciate a citation.

@inproceedings{zhu2026autolibra,
  title     = {AutoLibra: Agent Metric Induction from Open-Ended Human Feedback},
  author    = {Hao Zhu and Phil Cuvin and Xinkai Yu and
               Charlotte Ka Yee Yan and Jason Zhang and Diyi Yang},
  booktitle = {The Fourteenth International Conference on Learning Representations},
  year      = {2026},
  url       = {https://openreview.net/forum?id=4BjGVZ7Bxn}
}

Acknowledgments

This work is supported by ONR grant N000142412532, NSF grant IIS-2247357, and DARPA grant Friction for Accountability in Conversational Transactions. We thank Google Cloud Platform and Modal for compute credits, and all members of Stanford SALT Lab for their feedback throughout this project.