Methodology & About

How We Evaluate Models

At CompareAI, we believe that choosing the right LLM shouldn’t require a PhD in Machine Learning. Our platform aggregates standardized evaluations across multiple axes including Intelligence, Speed, and Price.

We do not conduct our own proprietary evaluations. We aggregate data from trusted, open sources such as the LMSYS Chatbot Arena, MMLU, HumanEval, and live API benchmarks provided by ArtificialAnalysis and others.

The Quality Score

Our "Quality Score" is a blended, normalized metric on a scale from 0 to 100. It is heavily weighted towards real-world human preference (Elo ratings) and general knowledge reasoning tasks.

  • 50% - LMSYS Chatbot Arena Elo (normalized)
  • 30% - MMLU / GPQA Diamond
  • 20% - HumanEval (Coding) & MATH

Calculating Blended Price

To create a single metric for the "Price" axis on our scatter charts, we use a Blended Token Price.

Because Output tokens are generally more expensive to generate than Input tokens are to process, but conversational workloads usually involve far more Input tokens (context) than Output tokens (generation), we standardise on a 3:1 ratio.

Blended Price = (Input Price × 3 + Output Price × 1) ÷ 4

KEY DEFINITIONS

Context Window

Maximum number of combined input & output tokens. Output tokens commonly have a significantly lower limit (varied by model).

Output Speed

Tokens per second received while the model is generating tokens (i.e. after first chunk has been received from the API for models which support streaming).

Latency (TTFT)

Time to first token received, in seconds, after API request sent. For reasoning models, this will be the first reasoning token.

Price

Price per token, represented as USD per million tokens. Price is a blend of Input & Output token prices (3:1 ratio).

Output Price

Price per token generated by the model (received from the API), represented as USD per million tokens.

Input Price

Price per token included in the request/message sent to the API, represented as USD per million tokens.

FREQUENTLY ASKED QUESTIONS

Based on our quality index scoring, the top-ranked model changes as providers release updates. Currently, reasoning-focused models from OpenAI and DeepSeek score the highest on overall intelligence benchmarks.

Lightweight models like Llama 3.1 8B and Gemini 2.0 Flash typically achieve the highest tokens-per-second rates, often exceeding 150-200 t/s depending on the API provider.

Open-source models served through competitive providers offer the best price-per-token. Models like Gemini 1.5 Flash and DeepSeek-V3 provide excellent quality-to-price ratios.

DeepSeek-R1 currently leads among open-weights models by quality score, followed closely by Llama 3.1 405B and DeepSeek-V3.

Use the provider filter tabs above the leaderboard table to narrow results by provider. You can also sort any column by clicking its header.

Click on any model name in the leaderboard to visit its dedicated page with detailed benchmark results, pricing breakdowns, and speed measurements across different providers.