Translation Prompt Evaluation

Generated: 2026-06-10 11:07:53 UTC

This page compares each AI translation prompt version against the best available human translation for the same lemma. Length regression uses human word count as x and AI word count as y; a slope near 1 and intercept near 0 means the AI is matching human translation length. BLEU, n-gram overlap, ROUGE-L, and chrF are exact-overlap metrics, so they are conservative when a good translation uses different wording.

Inputs: AI runs with status completed or approved, non-empty translation text, and any non-empty human translation, regardless of status. Full translation texts are used for metrics but are not printed on these pages.

Prompt evaluation trend plot
Prompt First translation Pairs Lemmas Slope Intercept R^2 Slope p |slope - 1| BLEU Trigram F1 ROUGE-L F1 chrF Mean abs residual
legacy_scholarly v1 2026-02-14 9 9 1.085 2.690 0.971 1.29e-06 0.085 31.0% 26.2% 60.3% 63.1% 3.191
legacy_scholarly v3 2026-05-07 14 14 1.045 -1.117 0.973 8.72e-11 0.045 59.6% 53.4% 79.5% 82.2% 3.302

Prompt Versions Without Evaluation Metrics

These prompt versions have AI translation runs, but they do not currently have enough tokenizable approved-human overlap to calculate the comparison metrics above.

Prompt First translation AI runs Lemmas Reason
legacy_scholarly v2 2026-05-24 350 209 no non-empty human translation in any status for its translated lemmas

Downloadable Tables