Labs/Findings/Methodology v1 Launch
AgentCrush Labs · Findings · 2026-05-16
What the agent economy looks like across 5+ signals
On May 16, 2026, AgentCrush shipped four category-specific scoring methodologies. Different agent categories leave different evidence trails, so we measure them differently. This page collects the most striking findings from that launch — the cases where multi-signal scoring produced a different answer than any single source would have given.
By the numbers
Total tracked
1,338
agents indexed
Evidence-ranked
25
across 4 categories
Categories live
4
model_family · tokenized · service · developer
MCP tools
7
machine-readable methodology
Finding #1 — headline
Single-source rankings invert under multi-signal scoring
The HuggingFace leader isn't the LMArena leader. The LMArena leader isn't the citation leader. The citation leader isn't the deployment leader. Each signal answers a different question — and when we combine them with documented weights, the resulting ranking is different from any of them taken alone.
HuggingFace #1
Qwen
score 100
LMArena #1
Gemini
BT 1484
Derivatives #1
Qwen
1,046
Citations #1
Llama
51,449
Deployments #1
Gemini
144
Composite #1 (the agent that maximizes across all 5 weighted signals): Qwen at score 85.
Finding #2 — the admission test
Hermes admitted at #5, ranks last — by design
NousResearch Hermes is a beloved community model. It would top a vibe-based ranking. Our methodology admitted it — and ranked it last among model families.
The rule is: 3 of 5 signals must be present, AND at least one must be a capability signal (derivatives, LMArena, citations, or cross-protocol deployment). For weeks Hermes had only 2 signals (HuggingFace + a thin derivatives footprint) — not evidence-ready. When we added paper citations and deployment scanning to the methodology, Hermes earned its third signal:
HuggingFace
69
LMArena
—
Derivatives
33
Citations
24
Deployment
27
Composite: 34. Hermes earned admission to the ranking via citations + deployment, but its raw footprint (HF downloads, no LMArena coverage, modest derivatives) keeps it at the back. This is the methodology working as designed: admit on evidence, rank on weight. No manual override.
Finding #3 — the honeypot test
Market cap alone is not a ranking
$TIBBIR has the largest USD market cap in the tokenized index ($108.2M). It does not rank #1.
AIXBT does, at composite 83. Why: the methodology weights on-chain liquidity (anti-honeypot), capital locked in token contracts (TVL = real commitment), and holder distribution. A high market cap with thin liquidity gets penalized, not rewarded. AgentCrush surfaced one Virtuals token at $380M market cap with $5K liquidity — exactly the pattern we built the methodology to demote.
Finding #4 — forks beat stars
Active engagement beats passive interest
For service agents (callable endpoints — A2A protocol, Agentverse, x402, ERC-8004), we use forks as a stronger adoption signal than stars. Anyone can star a repo. Forking means you're going to use or modify it.
A2A leads at composite 77 on 23,798 stars and 2,400 forks — a fork ratio that signals real-world deployment, not just bookmarking.
Why this matters
- For users: if you're choosing an agent to deploy, "most-downloaded" tells you what's popular, not what works. Multi-signal scoring is closer to what works.
- For LLMs: when Claude or any model retrieves AgentCrush data via our MCP server, the methodology travels with the answer. The model can call
get_methodology(category)and explain ranking decisions accurately to the end user. - For protocols: ERC-8004, x402, A2A, Virtuals, Agentverse — each leaves a different trail. A single universal ranking averages those trails into a meaningless number. Per-category methodology preserves the information.
Read the full methodology
Every weight, every formula, every limitation is published.