Generated: 2026-06-16 03:06:14 UTC
This page compares each AI translation prompt version against the best available human translation for the same lemma, using the automated MT metric set from Zainaldin et al. 2026: BLEU-4, chrF++, METEOR, ROUGE-L, BERTScore, COMET, and BLEURT. Length regression and residual tables remain as local diagnostics for translation-length drift.
Inputs: one representative AI run per lemma and prompt version, preferring the current preferred source text, with status completed or approved, non-empty translation text, and approved reviewed/final human translations only. Full translation texts are used for metrics but are not printed on these pages.
The paper metric set is BLEU-4, chrF++, METEOR, ROUGE-L, BERTScore, COMET, and BLEURT. Lexical metrics run locally. BERTScore needs the bert-score package, COMET needs unbabel-comet, and BLEURT needs the BLEURT package plus a local checkpoint path in BLEURT_CHECKPOINT.
| Metric | Status for this run |
|---|---|
| BLEU-4 | SacreBLEU sentence BLEU-4 |
| chrF++ | SacreBLEU chrF++ with word_order=2 |
| METEOR | NLTK METEOR with WordNet synonyms |
| ROUGE-L | rouge-score ROUGE-L with stemming |
| BERTScore | sidecar bert-score F1 |
| COMET | sidecar Unbabel/wmt22-comet-da |
| BLEURT | sidecar BLEURT checkpoint /home/stephanos/metric-envs/bleurt/BLEURT-20 |
Passage length is measured as the source Greek token count when source text is available, falling back to human translation word count only when a source passage has no Greek tokens. Each row regresses one metric against passage length for one prompt version.
Across 33 evaluable metric/prompt regressions, negative correlations are more common: 0 positive, 33 negative, and 0 flat. Using p < 0.05, 0 are significantly positive and 32 are significantly negative.
| Prompt | Metric | Rows | Pattern | Pearson r | R^2 | P-value | Slope | Status |
|---|---|---|---|---|---|---|---|---|
| legacy_scholarly v1 | BLEU-4 | 92 | significant negative | -0.249 | 0.062 | 0.0169 | -0.00120 | ok |
| legacy_scholarly v1 | chrF++ | 92 | negative | -0.201 | 0.041 | 0.0543 | -0.00064 | ok |
| legacy_scholarly v1 | METEOR | 92 | significant negative | -0.224 | 0.050 | 0.0317 | -0.00097 | ok |
| legacy_scholarly v1 | ROUGE-L | 92 | significant negative | -0.251 | 0.063 | 0.0159 | -0.00097 | ok |
| legacy_scholarly v1 | BERTScore | 92 | significant negative | -0.392 | 0.153 | 0.0001 | -0.00028 | ok |
| legacy_scholarly v1 | COMET | 92 | significant negative | -0.267 | 0.071 | 0.0101 | -0.00050 | ok |
| legacy_scholarly v1 | BLEURT | 92 | significant negative | -0.655 | 0.429 | 1.43e-12 | -0.00213 | ok |
| legacy_scholarly v1 | Trigram precision | 92 | significant negative | -0.206 | 0.042 | 0.0494 | -0.00108 | ok |
| legacy_scholarly v1 | Trigram recall | 92 | significant negative | -0.217 | 0.047 | 0.0375 | -0.00116 | ok |
| legacy_scholarly v1 | Trigram F1 | 92 | significant negative | -0.211 | 0.044 | 0.0437 | -0.00111 | ok |
| legacy_scholarly v1 | Trigram Jaccard | 92 | significant negative | -0.217 | 0.047 | 0.0374 | -0.00089 | ok |
| legacy_scholarly v2 | BLEU-4 | 89 | significant negative | -0.499 | 0.249 | 6.57e-07 | -0.00299 | ok |
| legacy_scholarly v2 | chrF++ | 89 | significant negative | -0.543 | 0.295 | 3.79e-08 | -0.00193 | ok |
| legacy_scholarly v2 | METEOR | 89 | significant negative | -0.548 | 0.300 | 2.71e-08 | -0.00217 | ok |
| legacy_scholarly v2 | ROUGE-L | 89 | significant negative | -0.492 | 0.242 | 1.00e-06 | -0.00179 | ok |
| legacy_scholarly v2 | BERTScore | 89 | significant negative | -0.670 | 0.449 | 6.72e-13 | -0.00045 | ok |
| legacy_scholarly v2 | COMET | 89 | significant negative | -0.488 | 0.238 | 1.24e-06 | -0.00084 | ok |
| legacy_scholarly v2 | BLEURT | 89 | significant negative | -0.757 | 0.573 | 9.09e-18 | -0.00265 | ok |
| legacy_scholarly v2 | Trigram precision | 89 | significant negative | -0.394 | 0.155 | 0.0001 | -0.00276 | ok |
| legacy_scholarly v2 | Trigram recall | 89 | significant negative | -0.414 | 0.172 | 5.45e-05 | -0.00278 | ok |
| legacy_scholarly v2 | Trigram F1 | 89 | significant negative | -0.405 | 0.164 | 8.24e-05 | -0.00276 | ok |
| legacy_scholarly v2 | Trigram Jaccard | 89 | significant negative | -0.403 | 0.162 | 9.04e-05 | -0.00285 | ok |
| legacy_scholarly v3 | BLEU-4 | 90 | significant negative | -0.375 | 0.141 | 0.0003 | -0.00257 | ok |
| legacy_scholarly v3 | chrF++ | 90 | significant negative | -0.481 | 0.231 | 1.64e-06 | -0.00185 | ok |
| legacy_scholarly v3 | METEOR | 90 | significant negative | -0.445 | 0.198 | 1.08e-05 | -0.00180 | ok |
| legacy_scholarly v3 | ROUGE-L | 90 | significant negative | -0.417 | 0.174 | 4.38e-05 | -0.00167 | ok |
| legacy_scholarly v3 | BERTScore | 90 | significant negative | -0.648 | 0.419 | 5.29e-12 | -0.00048 | ok |
| legacy_scholarly v3 | COMET | 90 | significant negative | -0.508 | 0.258 | 3.26e-07 | -0.00086 | ok |
| legacy_scholarly v3 | BLEURT | 90 | significant negative | -0.734 | 0.539 | 1.86e-16 | -0.00256 | ok |
| legacy_scholarly v3 | Trigram precision | 90 | significant negative | -0.289 | 0.084 | 0.0057 | -0.00232 | ok |
| legacy_scholarly v3 | Trigram recall | 90 | significant negative | -0.300 | 0.090 | 0.0040 | -0.00235 | ok |
| legacy_scholarly v3 | Trigram F1 | 90 | significant negative | -0.295 | 0.087 | 0.0047 | -0.00233 | ok |
| legacy_scholarly v3 | Trigram Jaccard | 90 | significant negative | -0.322 | 0.103 | 0.0020 | -0.00267 | ok |
Zainaldi et al.'s Galen translation is represented here by its reported mean passage length of 220.5 words. The Stephanos values below are raw ordinary-least-squares predictions from the metric-vs-passage-length regressions above; they are not clamped to metric bounds.
| Prompt | Synthetic passage words | Observed source word range | Extrapolated? | BLEU-4 | chrF++ | METEOR | ROUGE-L | BERTScore | COMET | BLEURT |
|---|---|---|---|---|---|---|---|---|---|---|
| legacy_scholarly v1 | 220.5 | 5.000-181.000 | yes | 11.0% | 47.6% | 47.0% | 47.1% | 88.2% | 64.9% | 26.9% |
| legacy_scholarly v2 | 220.5 | 5.000-181.000 | yes | -0.4% | 39.4% | 39.5% | 46.8% | 88.1% | 64.8% | 27.2% |
| legacy_scholarly v3 | 220.5 | 5.000-181.000 | yes | 6.5% | 40.9% | 46.5% | 48.9% | 87.7% | 64.6% | 29.5% |
Table 1 of Zainaldin et al. 2026 reports these values as mean scores multiplied by 100; the table below displays the same scale as percentages.
| Text | Model | BLEU-4 | chrF++ | METEOR | ROUGE-L | BERTScore | COMET | BLEURT |
|---|---|---|---|---|---|---|---|---|
| Mix. | ChatGPT | 31.4% | 53.4% | 46.4% | 50.9% | 91.0% | 79.9% | 49.8% |
| Mix. | Claude | 34.2% | 55.4% | 48.5% | 55.3% | 91.6% | 79.8% | 50.4% |
| Mix. | Gemini | 34.2% | 57.0% | 50.0% | 56.0% | 91.5% | 80.7% | 51.3% |
| Mix. | Aggregate | 33.3% | 55.3% | 48.3% | 54.1% | 91.4% | 80.1% | 50.5% |
| Comp. | ChatGPT | 15.7% | 47.4% | 40.1% | 45.7% | 89.1% | 75.1% | 42.6% |
| Comp. | Claude | 16.7% | 49.4% | 42.9% | 47.8% | 89.7% | 76.5% | 46.2% |
| Comp. | Gemini | 19.0% | 51.2% | 44.4% | 47.8% | 89.9% | 77.3% | 45.8% |
| Comp. | Aggregate | 17.1% | 49.3% | 42.5% | 47.1% | 89.5% | 76.3% | 44.9% |
| Prompt | First translation | Pairs | Lemmas | Slope | Intercept | R^2 | Slope p | |slope - 1| | BLEU-4 | chrF++ | METEOR | ROUGE-L | BERTScore | COMET | BLEURT | Trigram precision | Trigram recall | Trigram F1 | Trigram Jaccard | Mean abs residual |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| legacy_scholarly v1 | 2026-02-14 | 92 | 92 | 1.032 | 1.756 | 0.987 | 5.48e-87 | 0.032 | 33.7% | 59.7% | 65.4% | 65.5% | 93.4% | 74.4% | 67.2% | 26.4% | 28.3% | 27.3% | 15.8% | 3.418 |
| legacy_scholarly v2 | 2026-05-24 | 89 | 89 | 0.927 | 2.251 | 0.988 | 9.07e-86 | 0.073 | 56.0% | 75.8% | 80.6% | 80.7% | 96.7% | 80.7% | 77.3% | 47.4% | 46.2% | 46.8% | 30.5% | 2.665 |
| legacy_scholarly v3 | 2026-05-07 | 90 | 90 | 0.933 | 2.240 | 0.984 | 7.70e-81 | 0.067 | 55.1% | 75.9% | 80.5% | 80.5% | 96.7% | 80.8% | 77.9% | 48.1% | 47.2% | 47.6% | 31.3% | 3.169 |