
Claw-Eval introduces a comprehensive evaluation framework for autonomous agents that addresses three major shortcomings in existing benchmarks: trajectory-opaque grading, underspecified safety and robustness evaluation, and narrow modality coverage. By recording every agent action through execution traces, audit logs, and environment snapshots, it enables fine-grained scoring across 2,159 rubric items, providing a much richer picture of agent behavior than final output-only metrics.
The study reveals that trajectory-opaque evaluation systematically underperforms, missing 44% of safety violations and 13% of robustness failures. This highlights a critical gap in how we currently assess agent capabilities, especially in real-world deployments where intermediate steps can be just as important as final outcomes. The framework's use of Average Score, Pass@k, and Pass^k across trials also helps distinguish genuine capability from lucky outcomes, offering a more nuanced understanding of model performance.
Beyond benchmarking, Claw-Eval provides actionable insights for agent development, emphasizing that reliable deployment requires not just capability but also robustness and safety. The multimodal findings—where most models perform worse on video than on document or image—suggest that current models may be overfitting to certain modalities, pointing to a need for more balanced training and evaluation strategies.
ACE-Bench addresses two major limitations in current agent benchmarks: high environment interaction overhead and imbalanced task distributions. By designing a unified grid-based planning task, it allows for fine-grained control over task horizon (via hidden slots H) and difficulty (via decoy budget B), enabling scalable and interpretable evaluation across diverse models and domains.
The lightweight environment design, using static JSON files for all tool calls, eliminates setup overhead and supports fast, reproducible evaluation. This is crucial for training-time validation and iterative model development, where rapid feedback is essential. The benchmark's ability to maintain domain consistency and model discriminability across 13 models of varying sizes and families underscores its utility for rigorous agent comparison.
The orthogonal control of difficulty and horizon allows researchers to isolate and study specific aspects of agent reasoning, such as how models handle increasing complexity or longer planning horizons. This makes ACE-Bench not just a benchmark but a tool for understanding the limits and capabilities of agents in controlled, scalable settings.
The paper challenges the widely held assumption that instruction-following in LLMs is driven by a universal mechanism. Through diagnostic probing across nine tasks, it shows that general probes underperform task-specific specialists, indicating limited representational sharing and suggesting that instruction-following is more about skillful coordination than a single abstract process.
Cross-task transfer is weak and clustered by skill similarity, further supporting the idea that LLMs deploy different capabilities for different tasks rather than applying a uniform instruction-following mechanism. Causal ablation reveals sparse asymmetric dependencies, and temporal analysis shows that constraint satisfaction operates dynamically during generation, not in pre-planning stages.
This nuanced understanding of instruction-following has implications for model design and evaluation. It suggests that future models should be better equipped to coordinate diverse skills rather than relying on a single, generalized constraint-checking module. The findings also imply that instruction tuning may be more about skill acquisition and coordination than about learning a universal instruction-following process.
Epistemic blinding addresses a critical issue in LLM-assisted analysis: the silent blending of data-driven inference with memorized priors, which is invisible in single outputs. By anonymizing entity identifiers before prompting and comparing outputs against unblinded controls, the protocol restores auditability by measuring how much of an output stems from the data versus the model's parametric knowledge.
The approach is demonstrated in both oncology drug target prioritization and S&P 500 equity screening, showing that blinding can significantly alter top predictions while preserving the recovery of validated targets. This highlights the extent to which prior contamination can distort results, even in domains where data is abundant and models are trained on large datasets.
The protocol is released as an open-source tool and a Claude Code skill, lowering the barrier to adoption. While it doesn't necessarily improve results, it ensures that researchers can know to what degree their agents are adhering to the intended analytical process, which is crucial for trust and reproducibility in scientific and business applications.
Flowr presents a novel agentic AI framework for automating end-to-end retail supply chain workflows in large supermarket chains. By decomposing manual processes into specialized AI agents, it enables automation of tasks like demand forecasting, procurement, and inventory replenishment that were previously dependent on continuous human coordination.
The framework employs a consortium of fine-tuned, domain-specialized LLMs coordinated by a central reasoning LLM, ensuring task accuracy and adherence to responsible AI principles. A human-in-the-loop orchestration model, enabled by a Model Context Protocol (MCP), preserves accountability and organizational control, making it suitable for enterprise deployment.
Evaluation in collaboration with a large-scale supermarket chain shows that Flowr significantly reduces manual coordination overhead, improves demand-supply alignment, and enables proactive exception handling at scale. The framework is also domain-independent, offering a generalizable blueprint for agentic AI-driven automation in other large-scale enterprise settings, highlighting the potential for AI to transform complex, decision-intensive workflows.