Agentic Eval Taxonomy

Enter password to access the intake features, or browse read-only.

Agentic Eval Taxonomy

Agentic AI Evaluation Landscape

Comprehensive taxonomy of benchmarks across 17 capability dimensions (2024 – March 2026)

Total Benchmarks
--
Across 17 categories
Sources Analyzed
--
Papers, blogs, threads
New in 2026
--
Jan – Mar 2026
Capability Gaps
--
Zero-coverage dimensions
Researchers Tracked
18
Tier 1 & 2
Citation Graph
--
Papers traced
Last Updated
--
Checking...

Benchmarks by Category

Growth by Year

Benchmark Growth Over Time

Capability Coverage

Latest Additions

Benchmark Registry

All benchmarks searchable and filterable

# Benchmark Publisher Date Venue Tasks Top Score Category Imported Analysis

Capability Matrix

Benchmarks x 17 capability dimensions. = Primary   = Secondary

Gap Analysis

Over-evaluated, under-evaluated, and zero-coverage capability dimensions

Complete Gaps (Zero Coverage)

Under-Evaluated

Over-Evaluated

Benchmark Saturation Timeline

BenchmarkIntroducedTop ScoreStatus

Source Reader

Browse source summaries across arxiv, announcements, blog posts, and Twitter threads

All
arxiv
Announcements
Blogs
Twitter
Select a source to read

Citation Rankings

Benchmarks ranked by composite relevance score (citations + recency + diversity + specificity)

RankBenchmarkYearCitationsScoreTierPrimary Capabilities

Month by Month Progression

Chronological view of benchmark publications and key developments

Special Reports

In-depth analyses of specific topics in the agentic evaluation landscape

Feature Requests

Track ideas and improvements for the dashboard

Loading feature requests…