Generated: 2026-06-10 11:07:53 UTC
This page compares each AI translation prompt version against the best available human translation for the same lemma. Length regression uses human word count as x and AI word count as y; a slope near 1 and intercept near 0 means the AI is matching human translation length. BLEU, n-gram overlap, ROUGE-L, and chrF are exact-overlap metrics, so they are conservative when a good translation uses different wording.
Inputs: AI runs with status completed or approved, non-empty translation text, and any non-empty human translation, regardless of status. Full translation texts are used for metrics but are not printed on these pages.
| Prompt | First translation | Pairs | Lemmas | Slope | Intercept | R^2 | Slope p | |slope - 1| | BLEU | Trigram F1 | ROUGE-L F1 | chrF | Mean abs residual |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| legacy_scholarly v1 | 2026-02-14 | 9 | 9 | 1.085 | 2.690 | 0.971 | 1.29e-06 | 0.085 | 31.0% | 26.2% | 60.3% | 63.1% | 3.191 |
| legacy_scholarly v3 | 2026-05-07 | 14 | 14 | 1.045 | -1.117 | 0.973 | 8.72e-11 | 0.045 | 59.6% | 53.4% | 79.5% | 82.2% | 3.302 |
These prompt versions have AI translation runs, but they do not currently have enough tokenizable approved-human overlap to calculate the comparison metrics above.
| Prompt | First translation | AI runs | Lemmas | Reason |
|---|---|---|---|---|
| legacy_scholarly v2 | 2026-05-24 | 350 | 209 | no non-empty human translation in any status for its translated lemmas |