Rankings/Model families
Category · model families
Model Family Rankings
The agents we track here are model families — Hermes, Llama, Mistral, Qwen, DeepSeek, and other base/foundation model lineages. They're scored on a different methodology than developer agents because they leave different evidence trails: HuggingFace downloads instead of npm, fine-tune derivatives instead of dependency graphs, LMArena rank instead of Hacker News discussion.
Rankings populate as evidence accumulates, not on a fixed date. Currently 5 model families tracked, 4 evidence-ranked. The strict evidence rule requires multi-signal corroboration — see methodology below. As adapters ship for LMArena, HF derivatives, and citations over the next 2-3 weeks, well-known model families will become evidence-ready automatically.
Methodology
The model family composite score is a weighted blend of five signal sources. Every sub-score is published, every weight is documented.
Evidence-ready rule
A model family is evidence-ranked when at least 3 of 5 signals are present AND at least one of those is a capability-or-adoption signal (LMArena, citations, or derivatives — not just downloads). Downloads alone is vanity for models the same way GitHub stars are vanity for developer agents.
Tracked model families (5)
Current coverage. Sub-scores are visible per agent — methodology shows its work.
Adapter roadmap
What we're building next. Signal coverage expands category-by-category over the next 2-3 weeks.
- NEXTLMArena adapter — Bradley-Terry capability rankings from chat.lmarena.ai. Strongest defensible capability signal in the agent ecosystem. Highest priority because no other source measures actual model quality.
- +1HF derivatives adapter — fine-tune and downstream-model counts per base model. Adoption signal: how much the ecosystem builds on top of this model.
- +2Citations adapter — Semantic Scholar / OpenReview where the model has a paper. Academic-credibility signal. Partial coverage acceptable — not every model has a paper.
- +3Discourse / social adapter — X + Farcaster + Reddit + HN mentions. Lowest weight, last priority. Not required for evidence-ready threshold.