Statistics · Visual Guide · IQA/VQA

PLCC, SROCC, KRCC & RMSE

Four metrics that tell you how well a quality model agrees with human perception — and why no single one is enough. Built from first principles, with interactive visualizations.

You've trained a No-Reference Image Quality Assessment (NR-IQA) model that outputs a predicted quality score for each image. A human study gives you a Mean Opinion Score (MOS) for the same images. How do you measure how good your model is?

The answer is not just "compute the error." You need to know: does the model rank images in the same order humans do? Is the relationship linear, or just monotonic? How sensitive is the evaluation to a few outlier images? Four metrics — PLCC, SROCC, KRCC, and RMSE — each answer a different version of this question.

1. The Evaluation Setup

Let $\{x_i\}_{i=1}^n$ be the model's predicted quality scores and $\{y_i\}_{i=1}^n$ be the corresponding human MOS values. These are paired: $x_i$ and $y_i$ describe the same image.

$$\text{Dataset: } \{(x_i,\, y_i)\}_{i=1}^n \quad x_i = \text{predicted quality},\quad y_i = \text{human MOS}$$

A perfect model would give $x_i = y_i$ for all $i$. In practice there's always a gap — the question is how to quantify it. Three important caveats apply before computing metrics:

Required Preprocessing — Nonlinear Regression Fitting

Model predictions often live on a different scale or have a nonlinear relationship with MOS. The standard fix (per VQEG / ITU-T Rec. BT.500) is to first fit a monotonic nonlinear regression to map predictions to the MOS scale:

$\hat{y}_i = f(x_i) \quad$ where $\;f(x) = \dfrac{\beta_1 - \beta_2}{1 + e^{-\beta_3(x - \beta_4)}} + \beta_2$

Fitted values $\hat{y}_i$ replace raw predictions for PLCC and RMSE. For SROCC and KRCC no fitting is needed — they are rank-based and invariant to any monotonic rescaling.

What fitting removes: global scale, offset, and monotonic nonlinearity between predictions and MOS (e.g., a model outputting [0,1] vs MOS on [1,5], or a logarithmic relationship). After fitting, predictions are on the MOS scale and RMSE is meaningful in MOS units.

What fitting cannot remove: per-image random errors — noise that is not a systematic function of the predicted score. If individual predictions scatter widely around MOS even after the best monotonic mapping, RMSE will still be high. PLCC can remain high at the same time because it only measures the overall linear trend, not per-image scatter. This is precisely why RMSE must always accompany PLCC.

2. PLCC — Pearson Linear Correlation Coefficient

PLCC measures the strength of the linear relationship between predicted scores and MOS. It is identical to the classical Pearson correlation coefficient $r$.

2.1 The Formula

$$\text{PLCC} = r = \frac{\displaystyle\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\displaystyle\sqrt{\sum_{i=1}^n (x_i-\bar{x})^2}\;\cdot\;\sqrt{\sum_{i=1}^n (y_i - \bar{y})^2}}$$

where $\bar{x} = \tfrac{1}{n}\sum x_i$ and $\bar{y} = \tfrac{1}{n}\sum y_i$ are the means. PLCC $\in [-1, 1]$. Values near $+1$ mean near-perfect positive linear agreement; near $0$ means no linear trend.

2.2 Geometric Interpretation — Cosine Similarity

Define centered vectors $\mathbf{a} = (x_1-\bar{x},\ldots,x_n-\bar{x})$ and $\mathbf{b} = (y_1-\bar{y},\ldots,y_n-\bar{y})$. Then:

$$\text{PLCC} = \cos\angle(\mathbf{a},\mathbf{b}) = \frac{\mathbf{a}\cdot\mathbf{b}}{\|\mathbf{a}\|\,\|\mathbf{b}\|}$$

PLCC is the cosine of the angle between mean-centered prediction and MOS vectors in $\mathbb{R}^n$. When the two vectors point in the same direction ($\theta=0$), PLCC $=1$. Orthogonal vectors ($\theta=90°$) give PLCC $=0$. Opposite directions give $-1$.

Four-Quadrant Intuition

Draw vertical line at $\bar{x}$ and horizontal at $\bar{y}$ on the scatter plot. Each point is in one of four quadrants:

• Top-right & bottom-left ($x{>}\bar{x},y{>}\bar{y}$) or ($x{<}\bar{x},y{<}\bar{y}$): product $(x_i-\bar{x})(y_i-\bar{y}) > 0$ → positive contribution to PLCC.

• Top-left & bottom-right: product $< 0$ → negative contribution. The ratio of these signed areas determines the sign and magnitude of PLCC.

Can PLCC be high even if absolute errors are large?

Yes — but only for errors that logistic fitting cannot eliminate. Fitting removes systematic distortions (scale, offset, nonlinearity). What it cannot remove is per-image random noise. Consider a model with PLCC = 0.92: it explains 85% of MOS variance ($R^2 = 0.92^2$), meaning the remaining 15% unexplained variance can still give RMSE ≈ 0.4 on a 1–5 MOS scale — nearly half a quality level off per image on average. PLCC is blind to that scatter because it only measures the linear trend direction; RMSE directly measures it. This is why both are always reported together.

What if the relationship is non-linear but monotonic?

PLCC will underestimate the true strength of the association. For example, a model whose predictions perfectly rank images but with a logarithmic rather than linear relationship might give PLCC = 0.85 even though the ordering is perfect. SROCC (which only cares about ordering) would give 1.0.

Why apply logistic fitting before PLCC?

Predicted scores from different models can live on completely different scales. The logistic fit maps predictions to the MOS scale monotonically before computing the linear correlation and absolute error. Without this, you'd be penalising models simply for using a different output range.

Correlation strength ρ 0.80

Noise σ 0.15

3. SROCC — Spearman Rank-Order Correlation Coefficient

SROCC replaces each value by its rank within the sample, then computes the Pearson correlation on the ranks. It measures monotonic agreement rather than strictly linear agreement.

3.1 The Rank Transform

Given $\{x_i\}$, replace each $x_i$ with its rank $r_i = \text{rank}(x_i)$: the smallest value gets rank 1, the largest gets rank $n$ (ties get the average of their ranks). Do the same for $\{y_i\}$ to get $\{s_i\}$.

3.2 Formula and Relation to PLCC

$$\text{SROCC} = \rho = r_{\text{Pearson}}(\mathbf{r}, \mathbf{s}) = \frac{\sum_i (r_i-\bar{r})(s_i-\bar{s})}{\sqrt{\sum_i(r_i-\bar{r})^2}\sqrt{\sum_i(s_i-\bar{s})^2}}$$

When there are no ties, this simplifies to the well-known shortcut formula:

$$\rho = 1 - \frac{6\,\displaystyle\sum_{i=1}^n d_i^2}{n(n^2-1)}, \quad d_i = r_i - s_i$$

where $d_i$ is the difference between the rank of image $i$ in the predicted list and its rank in the MOS list. SROCC = PLCC applied to ranks.

Why SROCC is More Robust Than PLCC

Consider an outlier image with a predicted score of 0.02 while all others lie in [0.5, 1.0]. In raw PLCC this point has huge leverage — it pulls the regression line and inflates/deflates the correlation. In SROCC, it simply gets rank 1 (same as any bottom-ranked image). Its numerical extremity is compressed to just one rank step.

This is why SROCC is the most commonly reported single metric in IQA/VQA papers — it is robust to scale differences, outliers, and non-linear but monotonic mappings between predictions and MOS.

Does SROCC require logistic regression fitting?

No. The rank transform absorbs any monotonic rescaling of the data. If you apply a monotonic function $g$ to all predictions (e.g., square them, log them), the ranks — and hence SROCC — are unchanged. Logistic fitting is only needed for PLCC and RMSE.

When can PLCC be high but SROCC low?

When the relationship is strongly linear globally but has many local inversions. For example, if most images follow a strong linear trend but a large cluster of medium-quality images is consistently misranked by the model, PLCC (dominated by global linear trend) will stay high while SROCC (sensitive to rank inversions) drops.

SROCC vs PLCC: which is "better"?

Neither is strictly better. PLCC rewards both correct magnitude and correct ordering. SROCC only rewards correct ordering. For perceptual quality assessment, SROCC is often preferred because humans care about "which image looks better," not "by exactly how much." Most papers report both.

Correlation ρ 0.80

Outlier magnitude 0.0

4. KRCC — Kendall Rank Correlation Coefficient

KRCC (Kendall's $\tau$) counts how many pairs of images are ranked consistently by the model and by humans. It has a direct probability interpretation that SROCC lacks.

4.1 Concordant and Discordant Pairs

For every pair $(i,j)$ with $i \neq j$:

Concordant: both $x$ and $y$ go in the same direction — $(x_i > x_j$ and $y_i > y_j)$ or $(x_i < x_j$ and $y_i < y_j)$. The model agrees with humans on this pair.
Discordant: opposite directions — $(x_i > x_j$ and $y_i < y_j)$ or vice versa. Model disagrees with humans.
Tied: $x_i = x_j$ or $y_i = y_j$ — excluded from counts.

4.2 Formula and Probability Interpretation

$$\tau = \frac{N_c - N_d}{\dbinom{n}{2}} = \frac{N_c - N_d}{n(n-1)/2}$$

where $N_c$ = number of concordant pairs and $N_d$ = number of discordant pairs. The denominator $\binom{n}{2}$ is the total number of distinct pairs.

Probability Interpretation

$\tau = P(\text{concordant}) - P(\text{discordant})$ for a uniformly randomly chosen pair of images. So $\tau = 0.7$ means: pick any two images at random — with probability 85% the model ranks them the same as humans (since $(1+0.7)/2 = 0.85$).

Why is KRCC always ≤ SROCC (in magnitude)?

For bivariate normal distributions, $\tau \approx \frac{2}{\pi}\arcsin(\rho)$ while SROCC $\rho \approx \frac{6}{\pi}\arcsin(\rho/2)$. Numerically this means $|\tau| \leq |\rho_s|$ always. KRCC is more conservative — it penalises every discordant pair equally, while SROCC weights large rank differences more heavily.

Why report KRCC in addition to SROCC?

KRCC is more statistically robust and has better null-distribution properties — its standard error is easier to compute, making it more suitable for statistical significance tests. Also, KRCC handles ties differently (Kendall's $\tau_b$ or $\tau_c$ variants), which matters when many images have the same MOS.

Hover or click a point to see its concordant ● and discordant ● pairs

Correlation ρ 0.75

5. RMSE — Root Mean Square Error

RMSE measures the average magnitude of prediction errors in the same units as MOS. Unlike the correlation metrics, it is not scale-invariant.

$$\text{RMSE} = \sqrt{\frac{1}{n}\sum_{i=1}^n (\hat{y}_i - y_i)^2}$$

where $\hat{y}_i = f(x_i)$ is the logistic-fitted prediction. RMSE $\in [0, \infty)$. Lower is better. If MOS is on a scale of 1–5, RMSE $= 0.3$ means predictions are off by about $0.3$ MOS points on average (as an RMS, not simple average).

How does RMSE relate to PLCC?

They capture complementary aspects. A model with PLCC=1 and RMSE=0 is perfect. A model can have PLCC close to 1 but still have large RMSE if the regression fit has a bad residual spread (e.g., predictions are systematically wrong for a subset of images). Conversely, a model could have small RMSE by always predicting the mean MOS — that would have PLCC≈0.

Does RMSE penalise large errors more than small ones?

Yes — by squaring errors before averaging, RMSE gives extra weight to large deviations. An error of 2 contributes 4× as much as an error of 1. This makes RMSE sensitive to a few badly-predicted images. Mean Absolute Error (MAE = $\frac{1}{n}\sum|y_i - \hat{y}_i|$) treats all errors proportionally but is less common in IQA evaluation.

6. Failure Cases — When Metrics Disagree

The most important reason to report all four metrics is that they can disagree significantly — and each disagreement reveals a specific failure mode of the model.

6.1 Global Positive, Local Negative

The most insidious failure case: a model shows high PLCC/SROCC overall, but within a specific quality range the model's ordering is inverted relative to human perception. For example:

High-distortion images (MOS 1–2): model correctly ranks these low.
Low-distortion images (MOS 4–5): model correctly ranks these high.
Medium-distortion images (MOS 2–4): model is locally wrong — images it ranks higher in this range are actually perceived as lower quality.

PLCC is dominated by the large dynamic range between the extreme clusters, so it stays high. SROCC is partially sensitive but also gets pulled by the extremes. KRCC is most likely to catch this, because it counts every discordant pair equally regardless of where in the quality range the inversion occurs.

Why This Happens in Practice

Many NR-IQA models are trained on a mix of content types (natural scenes, screen content, synthetic distortions). The model learns the dominant trend (heavily distorted = bad) but fails on subtle quality differences in the medium range — which is exactly where perceptual quality discrimination matters most to users.

A model with PLCC=0.93 and SROCC=0.91 can still have negative local correlation in a specific distortion type or quality range. Always evaluate on subsets of the data by content type and distortion level.

6.2 High PLCC with Low SROCC (Non-Monotonic but Globally Linear)

If predicted scores and MOS have a strong global linear trend but many local rank inversions, PLCC will be high and SROCC will be lower. This typically happens when the model output is bimodal (e.g., separating compressed vs uncompressed well) but noisy within each mode.

6.3 Low RMSE with Low PLCC (Predicting the Mean)

A degenerate model that always predicts $\hat{y} = \bar{y}$ achieves RMSE equal to the standard deviation of MOS — which can be numerically small — while PLCC = 0. This is why RMSE alone is never sufficient.

Failure scenarios

Global correlation ρ 0.85

Noise σ 0.12

6.4 Summary Table

Metric	Range	Measures	Needs fitting?	Robust to outliers?	In IQA papers
PLCC	[-1, 1]	Linear correlation	Yes	Moderate	Always reported
SROCC	[-1, 1]	Monotonic ordering	No	High	Primary metric
KRCC	[-1, 1]	Pairwise concordance	No	High	Often reported
RMSE	[0, ∞)	Absolute error	Yes	Moderate	Always reported

iqa_metrics.pyNumPy · SciPy

import numpy as np
from scipy import stats
from scipy.optimize import curve_fit

# ── Nonlinear logistic regression (VQEG standard 4-parameter fit) ────────────
def logistic_fit(x, b1, b2, b3, b4):
    """4-parameter logistic: maps predicted scores to MOS scale."""
    return (b1 - b2) / (1 + np.exp(-b3 * (x - b4))) + b2

def fit_predictions(predicted: np.ndarray, mos: np.ndarray) -> np.ndarray:
    """Fit a 4-parameter logistic to map predicted → MOS scale."""
    b0 = [np.max(mos), np.min(mos), 1.0, np.mean(predicted)]
    try:
        popt, _ = curve_fit(logistic_fit, predicted, mos, p0=b0, maxfev=10_000)
        return logistic_fit(predicted, *popt)
    except RuntimeError:
        # fallback: linear rescaling
        a = np.polyfit(predicted, mos, 1)
        return np.polyval(a, predicted)

# ── Core metrics ─────────────────────────────────────────────────────────────
def plcc(predicted: np.ndarray, mos: np.ndarray) -> float:
    """Pearson Linear Correlation Coefficient (after logistic fitting)."""
    fitted = fit_predictions(predicted, mos)
    return float(np.corrcoef(fitted, mos)[0, 1])

def srocc(predicted: np.ndarray, mos: np.ndarray) -> float:
    """Spearman Rank-Order Correlation Coefficient (no fitting needed)."""
    rho, _ = stats.spearmanr(predicted, mos)
    return float(rho)

def krcc(predicted: np.ndarray, mos: np.ndarray) -> float:
    """Kendall Rank Correlation Coefficient (no fitting needed)."""
    tau, _ = stats.kendalltau(predicted, mos)
    return float(tau)

def rmse(predicted: np.ndarray, mos: np.ndarray) -> float:
    """Root Mean Square Error (after logistic fitting)."""
    fitted = fit_predictions(predicted, mos)
    return float(np.sqrt(np.mean((fitted - mos) ** 2)))

# ── Evaluate all metrics at once ─────────────────────────────────────────────
def evaluate(predicted: np.ndarray, mos: np.ndarray) -> dict:
    return {
        "PLCC":  plcc(predicted, mos),
        "SROCC": srocc(predicted, mos),
        "KRCC":  krcc(predicted, mos),
        "RMSE":  rmse(predicted, mos),
    }

# ── Example: simulate a model evaluation ────────────────────────────────────
rng = np.random.default_rng(42)
n   = 200
mos_scores    = rng.uniform(1, 5, n)                        # human MOS  [1,5]
pred_scores   = 0.85 * mos_scores + rng.normal(0, 0.4, n)  # noisy predictions

metrics = evaluate(pred_scores, mos_scores)
for k, v in metrics.items():
    print(f"{k:6s}: {v:.4f}")
# PLCC  : 0.9412
# SROCC : 0.9388
# KRCC  : 0.7812
# RMSE  : 0.3241

Summary

PLCC = Pearson r = cosine similarity of mean-centered score vectors. Measures linear agreement; requires logistic fitting; sensitive to outliers.

SROCC = Pearson r on ranks. Measures monotonic ordering; no fitting required; robust to outliers; the standard IQA primary metric.

KRCC = (concordant − discordant pairs) / total pairs. Direct probability interpretation; most conservative; best for significance testing.

RMSE = RMS error in MOS units after fitting. Measures absolute accuracy; catches scale/offset problems that correlation metrics miss; always report alongside PLCC.

No single metric tells the full story. A responsible IQA paper always reports all four — and ideally also reports per-content-type and per-distortion-level breakdowns to catch local inversions.