AI Agent Evaluation, LLM Instruction Following, and Figma AI Integration - April 2026 | Research Digest

AI Agent Evaluation and LLM Insights

News

How Figmates Used Figma AI to Take Delight to the Next Level | Figma Blog

Figmates using Figma AI tools during April Fun Day 2026

Papers

Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents

Claw-Eval introduces a comprehensive evaluation framework for autonomous agents that addresses three major shortcomings in existing benchmarks: trajectory-opaque grading, underspecified safety and robustness evaluation, and narrow modality coverage. By recording every agent action through execution traces, audit logs, and environment snapshots, it enables fine-grained scoring across 2,159 rubric items, providing a much richer picture of agent behavior than final output-only metrics.

The study reveals that trajectory-opaque evaluation systematically underperforms, missing 44% of safety violations and 13% of robustness failures. This highlights a critical gap in how we currently assess agent capabilities, especially in real-world deployments where intermediate steps can be just as important as final outcomes. The framework's use of Average Score, Pass@k, and Pass^k across trials also helps distinguish genuine capability from lucky outcomes, offering a more nuanced understanding of model performance.

Beyond benchmarking, Claw-Eval provides actionable insights for agent development, emphasizing that reliable deployment requires not just capability but also robustness and safety. The multimodal findings—where most models perform worse on video than on document or image—suggest that current models may be overfitting to certain modalities, pointing to a need for more balanced training and evaluation strategies.

Key Insight: Trajectory-aware evaluation is essential for reliable agent assessment, revealing that traditional benchmarks miss critical safety and robustness issues.

ACE-Bench: Agent Configurable Evaluation with Scalable Horizons and Controllable Difficulty under Lightweight Environments

arXiv:2604.06111

ACE-Bench addresses two major limitations in current agent benchmarks: high environment interaction overhead and imbalanced task distributions. By designing a unified grid-based planning task, it allows for fine-grained control over task horizon (via hidden slots H) and difficulty (via decoy budget B), enabling scalable and interpretable evaluation across diverse models and domains.

The lightweight environment design, using static JSON files for all tool calls, eliminates setup overhead and supports fast, reproducible evaluation. This is crucial for training-time validation and iterative model development, where rapid feedback is essential. The benchmark's ability to maintain domain consistency and model discriminability across 13 models of varying sizes and families underscores its utility for rigorous agent comparison.

The orthogonal control of difficulty and horizon allows researchers to isolate and study specific aspects of agent reasoning, such as how models handle increasing complexity or longer planning horizons. This makes ACE-Bench not just a benchmark but a tool for understanding the limits and capabilities of agents in controlled, scalable settings.

Key Insight: ACE-Bench offers scalable, controllable, and lightweight agent evaluation through a grid-based planning task with orthogonal axes of difficulty and horizon.

How LLMs Follow Instructions: Skillful Coordination, Not a Universal Mechanism

arXiv:2604.06015

The paper challenges the widely held assumption that instruction-following in LLMs is driven by a universal mechanism. Through diagnostic probing across nine tasks, it shows that general probes underperform task-specific specialists, indicating limited representational sharing and suggesting that instruction-following is more about skillful coordination than a single abstract process.

Cross-task transfer is weak and clustered by skill similarity, further supporting the idea that LLMs deploy different capabilities for different tasks rather than applying a uniform instruction-following mechanism. Causal ablation reveals sparse asymmetric dependencies, and temporal analysis shows that constraint satisfaction operates dynamically during generation, not in pre-planning stages.

This nuanced understanding of instruction-following has implications for model design and evaluation. It suggests that future models should be better equipped to coordinate diverse skills rather than relying on a single, generalized constraint-checking module. The findings also imply that instruction tuning may be more about skill acquisition and coordination than about learning a universal instruction-following process.

Key Insight: Instruction-following in LLMs is better explained as skillful coordination of diverse linguistic capabilities rather than a universal constraint-checking mechanism.

Epistemic Blinding: An Inference-Time Protocol for Auditing Prior Contamination in LLM-Assisted Analysis

arXiv:2604.06013

Epistemic blinding addresses a critical issue in LLM-assisted analysis: the silent blending of data-driven inference with memorized priors, which is invisible in single outputs. By anonymizing entity identifiers before prompting and comparing outputs against unblinded controls, the protocol restores auditability by measuring how much of an output stems from the data versus the model's parametric knowledge.

The approach is demonstrated in both oncology drug target prioritization and S&P 500 equity screening, showing that blinding can significantly alter top predictions while preserving the recovery of validated targets. This highlights the extent to which prior contamination can distort results, even in domains where data is abundant and models are trained on large datasets.

The protocol is released as an open-source tool and a Claude Code skill, lowering the barrier to adoption. While it doesn't necessarily improve results, it ensures that researchers can know to what degree their agents are adhering to the intended analytical process, which is crucial for trust and reproducibility in scientific and business applications.

Key Insight: Epistemic blinding is a protocol that separates data-driven inference from memorized priors in LLM outputs, enhancing auditability in agentic systems.

Flowr -- Scaling Up Retail Supply Chain Operations Through Agentic AI in Large Scale Supermarket Chains

arXiv:2604.05987

Flowr presents a novel agentic AI framework for automating end-to-end retail supply chain workflows in large supermarket chains. By decomposing manual processes into specialized AI agents, it enables automation of tasks like demand forecasting, procurement, and inventory replenishment that were previously dependent on continuous human coordination.

The framework employs a consortium of fine-tuned, domain-specialized LLMs coordinated by a central reasoning LLM, ensuring task accuracy and adherence to responsible AI principles. A human-in-the-loop orchestration model, enabled by a Model Context Protocol (MCP), preserves accountability and organizational control, making it suitable for enterprise deployment.

Evaluation in collaboration with a large-scale supermarket chain shows that Flowr significantly reduces manual coordination overhead, improves demand-supply alignment, and enables proactive exception handling at scale. The framework is also domain-independent, offering a generalizable blueprint for agentic AI-driven automation in other large-scale enterprise settings, highlighting the potential for AI to transform complex, decision-intensive workflows.

Key Insight: Flowr demonstrates how agentic AI can automate complex retail supply chain operations, reducing manual coordination and improving decision-making at scale.