Statistics · Visual Guide · IQA/VQA
PLCC, SROCC, KRCC & RMSE
Four metrics that tell you how well a quality model agrees with human perception — and why no single one is enough. Built from first principles, with interactive visualizations.
You've trained a No-Reference Image Quality Assessment (NR-IQA) model that outputs a predicted quality score for each image. A human study gives you a Mean Opinion Score (MOS) for the same images. How do you measure how good your model is?
The answer is not just "compute the error." You need to know: does the model rank images in the same order humans do? Is the relationship linear, or just monotonic? How sensitive is the evaluation to a few outlier images? Four metrics — PLCC, SROCC, KRCC, and RMSE — each answer a different version of this question.
1. The Evaluation Setup
Let $\{x_i\}_{i=1}^n$ be the model's predicted quality scores and $\{y_i\}_{i=1}^n$ be the corresponding human MOS values. These are paired: $x_i$ and $y_i$ describe the same image.
A perfect model would give $x_i = y_i$ for all $i$. In practice there's always a gap — the question is how to quantify it. Three important caveats apply before computing metrics:
Model predictions often live on a different scale or have a nonlinear relationship with MOS. The standard fix (per VQEG / ITU-T Rec. BT.500) is to first fit a monotonic nonlinear regression to map predictions to the MOS scale:
$\hat{y}_i = f(x_i) \quad$ where $\;f(x) = \dfrac{\beta_1 - \beta_2}{1 + e^{-\beta_3(x - \beta_4)}} + \beta_2$
Fitted values $\hat{y}_i$ replace raw predictions for PLCC and RMSE. For SROCC and KRCC no fitting is needed — they are rank-based and invariant to any monotonic rescaling.
What fitting removes: global scale, offset, and monotonic nonlinearity between predictions and MOS (e.g., a model outputting [0,1] vs MOS on [1,5], or a logarithmic relationship). After fitting, predictions are on the MOS scale and RMSE is meaningful in MOS units.
What fitting cannot remove: per-image random errors — noise that is not a systematic function of the predicted score. If individual predictions scatter widely around MOS even after the best monotonic mapping, RMSE will still be high. PLCC can remain high at the same time because it only measures the overall linear trend, not per-image scatter. This is precisely why RMSE must always accompany PLCC.
2. PLCC — Pearson Linear Correlation Coefficient
PLCC measures the strength of the linear relationship between predicted scores and MOS. It is identical to the classical Pearson correlation coefficient $r$.
2.1 The Formula
where $\bar{x} = \tfrac{1}{n}\sum x_i$ and $\bar{y} = \tfrac{1}{n}\sum y_i$ are the means. PLCC $\in [-1, 1]$. Values near $+1$ mean near-perfect positive linear agreement; near $0$ means no linear trend.
2.2 Geometric Interpretation — Cosine Similarity
Define centered vectors $\mathbf{a} = (x_1-\bar{x},\ldots,x_n-\bar{x})$ and $\mathbf{b} = (y_1-\bar{y},\ldots,y_n-\bar{y})$. Then:
PLCC is the cosine of the angle between mean-centered prediction and MOS vectors in $\mathbb{R}^n$. When the two vectors point in the same direction ($\theta=0$), PLCC $=1$. Orthogonal vectors ($\theta=90°$) give PLCC $=0$. Opposite directions give $-1$.
Draw vertical line at $\bar{x}$ and horizontal at $\bar{y}$ on the scatter plot. Each point is in one of four quadrants:
• Top-right & bottom-left ($x{>}\bar{x},y{>}\bar{y}$) or ($x{<}\bar{x},y{<}\bar{y}$): product $(x_i-\bar{x})(y_i-\bar{y}) > 0$ → positive contribution to PLCC.
• Top-left & bottom-right: product $< 0$ → negative contribution. The ratio of these signed areas determines the sign and magnitude of PLCC.
3. SROCC — Spearman Rank-Order Correlation Coefficient
SROCC replaces each value by its rank within the sample, then computes the Pearson correlation on the ranks. It measures monotonic agreement rather than strictly linear agreement.
3.1 The Rank Transform
Given $\{x_i\}$, replace each $x_i$ with its rank $r_i = \text{rank}(x_i)$: the smallest value gets rank 1, the largest gets rank $n$ (ties get the average of their ranks). Do the same for $\{y_i\}$ to get $\{s_i\}$.
3.2 Formula and Relation to PLCC
When there are no ties, this simplifies to the well-known shortcut formula:
where $d_i$ is the difference between the rank of image $i$ in the predicted list and its rank in the MOS list. SROCC = PLCC applied to ranks.
Consider an outlier image with a predicted score of 0.02 while all others lie in [0.5, 1.0]. In raw PLCC this point has huge leverage — it pulls the regression line and inflates/deflates the correlation. In SROCC, it simply gets rank 1 (same as any bottom-ranked image). Its numerical extremity is compressed to just one rank step.
This is why SROCC is the most commonly reported single metric in IQA/VQA papers — it is robust to scale differences, outliers, and non-linear but monotonic mappings between predictions and MOS.
4. KRCC — Kendall Rank Correlation Coefficient
KRCC (Kendall's $\tau$) counts how many pairs of images are ranked consistently by the model and by humans. It has a direct probability interpretation that SROCC lacks.
4.1 Concordant and Discordant Pairs
For every pair $(i,j)$ with $i \neq j$:
- Concordant: both $x$ and $y$ go in the same direction — $(x_i > x_j$ and $y_i > y_j)$ or $(x_i < x_j$ and $y_i < y_j)$. The model agrees with humans on this pair.
- Discordant: opposite directions — $(x_i > x_j$ and $y_i < y_j)$ or vice versa. Model disagrees with humans.
- Tied: $x_i = x_j$ or $y_i = y_j$ — excluded from counts.
4.2 Formula and Probability Interpretation
where $N_c$ = number of concordant pairs and $N_d$ = number of discordant pairs. The denominator $\binom{n}{2}$ is the total number of distinct pairs.
$\tau = P(\text{concordant}) - P(\text{discordant})$ for a uniformly randomly chosen pair of images. So $\tau = 0.7$ means: pick any two images at random — with probability 85% the model ranks them the same as humans (since $(1+0.7)/2 = 0.85$).
5. RMSE — Root Mean Square Error
RMSE measures the average magnitude of prediction errors in the same units as MOS. Unlike the correlation metrics, it is not scale-invariant.
where $\hat{y}_i = f(x_i)$ is the logistic-fitted prediction. RMSE $\in [0, \infty)$. Lower is better. If MOS is on a scale of 1–5, RMSE $= 0.3$ means predictions are off by about $0.3$ MOS points on average (as an RMS, not simple average).
6. Failure Cases — When Metrics Disagree
The most important reason to report all four metrics is that they can disagree significantly — and each disagreement reveals a specific failure mode of the model.
6.1 Global Positive, Local Negative
The most insidious failure case: a model shows high PLCC/SROCC overall, but within a specific quality range the model's ordering is inverted relative to human perception. For example:
- High-distortion images (MOS 1–2): model correctly ranks these low.
- Low-distortion images (MOS 4–5): model correctly ranks these high.
- Medium-distortion images (MOS 2–4): model is locally wrong — images it ranks higher in this range are actually perceived as lower quality.
PLCC is dominated by the large dynamic range between the extreme clusters, so it stays high. SROCC is partially sensitive but also gets pulled by the extremes. KRCC is most likely to catch this, because it counts every discordant pair equally regardless of where in the quality range the inversion occurs.
Many NR-IQA models are trained on a mix of content types (natural scenes, screen content, synthetic distortions). The model learns the dominant trend (heavily distorted = bad) but fails on subtle quality differences in the medium range — which is exactly where perceptual quality discrimination matters most to users.
A model with PLCC=0.93 and SROCC=0.91 can still have negative local correlation in a specific distortion type or quality range. Always evaluate on subsets of the data by content type and distortion level.
6.2 High PLCC with Low SROCC (Non-Monotonic but Globally Linear)
If predicted scores and MOS have a strong global linear trend but many local rank inversions, PLCC will be high and SROCC will be lower. This typically happens when the model output is bimodal (e.g., separating compressed vs uncompressed well) but noisy within each mode.
6.3 Low RMSE with Low PLCC (Predicting the Mean)
A degenerate model that always predicts $\hat{y} = \bar{y}$ achieves RMSE equal to the standard deviation of MOS — which can be numerically small — while PLCC = 0. This is why RMSE alone is never sufficient.
6.4 Summary Table
| Metric | Range | Measures | Needs fitting? | Robust to outliers? | In IQA papers |
|---|---|---|---|---|---|
| PLCC | [-1, 1] | Linear correlation | Yes | Moderate | Always reported |
| SROCC | [-1, 1] | Monotonic ordering | No | High | Primary metric |
| KRCC | [-1, 1] | Pairwise concordance | No | High | Often reported |
| RMSE | [0, ∞) | Absolute error | Yes | Moderate | Always reported |
import numpy as np
from scipy import stats
from scipy.optimize import curve_fit
# ── Nonlinear logistic regression (VQEG standard 4-parameter fit) ────────────
def logistic_fit(x, b1, b2, b3, b4):
"""4-parameter logistic: maps predicted scores to MOS scale."""
return (b1 - b2) / (1 + np.exp(-b3 * (x - b4))) + b2
def fit_predictions(predicted: np.ndarray, mos: np.ndarray) -> np.ndarray:
"""Fit a 4-parameter logistic to map predicted → MOS scale."""
b0 = [np.max(mos), np.min(mos), 1.0, np.mean(predicted)]
try:
popt, _ = curve_fit(logistic_fit, predicted, mos, p0=b0, maxfev=10_000)
return logistic_fit(predicted, *popt)
except RuntimeError:
# fallback: linear rescaling
a = np.polyfit(predicted, mos, 1)
return np.polyval(a, predicted)
# ── Core metrics ─────────────────────────────────────────────────────────────
def plcc(predicted: np.ndarray, mos: np.ndarray) -> float:
"""Pearson Linear Correlation Coefficient (after logistic fitting)."""
fitted = fit_predictions(predicted, mos)
return float(np.corrcoef(fitted, mos)[0, 1])
def srocc(predicted: np.ndarray, mos: np.ndarray) -> float:
"""Spearman Rank-Order Correlation Coefficient (no fitting needed)."""
rho, _ = stats.spearmanr(predicted, mos)
return float(rho)
def krcc(predicted: np.ndarray, mos: np.ndarray) -> float:
"""Kendall Rank Correlation Coefficient (no fitting needed)."""
tau, _ = stats.kendalltau(predicted, mos)
return float(tau)
def rmse(predicted: np.ndarray, mos: np.ndarray) -> float:
"""Root Mean Square Error (after logistic fitting)."""
fitted = fit_predictions(predicted, mos)
return float(np.sqrt(np.mean((fitted - mos) ** 2)))
# ── Evaluate all metrics at once ─────────────────────────────────────────────
def evaluate(predicted: np.ndarray, mos: np.ndarray) -> dict:
return {
"PLCC": plcc(predicted, mos),
"SROCC": srocc(predicted, mos),
"KRCC": krcc(predicted, mos),
"RMSE": rmse(predicted, mos),
}
# ── Example: simulate a model evaluation ────────────────────────────────────
rng = np.random.default_rng(42)
n = 200
mos_scores = rng.uniform(1, 5, n) # human MOS [1,5]
pred_scores = 0.85 * mos_scores + rng.normal(0, 0.4, n) # noisy predictions
metrics = evaluate(pred_scores, mos_scores)
for k, v in metrics.items():
print(f"{k:6s}: {v:.4f}")
# PLCC : 0.9412
# SROCC : 0.9388
# KRCC : 0.7812
# RMSE : 0.3241
PLCC = Pearson r = cosine similarity of mean-centered score vectors. Measures linear agreement; requires logistic fitting; sensitive to outliers.
SROCC = Pearson r on ranks. Measures monotonic ordering; no fitting required; robust to outliers; the standard IQA primary metric.
KRCC = (concordant − discordant pairs) / total pairs. Direct probability interpretation; most conservative; best for significance testing.
RMSE = RMS error in MOS units after fitting. Measures absolute accuracy; catches scale/offset problems that correlation metrics miss; always report alongside PLCC.
No single metric tells the full story. A responsible IQA paper always reports all four — and ideally also reports per-content-type and per-distortion-level breakdowns to catch local inversions.