Translation Prompt Evaluation

Generated: 2026-06-16 03:06:14 UTC

This page compares each AI translation prompt version against the best available human translation for the same lemma, using the automated MT metric set from Zainaldin et al. 2026: BLEU-4, chrF++, METEOR, ROUGE-L, BERTScore, COMET, and BLEURT. Length regression and residual tables remain as local diagnostics for translation-length drift.

Inputs: one representative AI run per lemma and prompt version, preferring the current preferred source text, with status completed or approved, non-empty translation text, and approved reviewed/final human translations only. Full translation texts are used for metrics but are not printed on these pages.

Metric Engines

The paper metric set is BLEU-4, chrF++, METEOR, ROUGE-L, BERTScore, COMET, and BLEURT. Lexical metrics run locally. BERTScore needs the bert-score package, COMET needs unbabel-comet, and BLEURT needs the BLEURT package plus a local checkpoint path in BLEURT_CHECKPOINT.

MetricStatus for this run
BLEU-4 SacreBLEU sentence BLEU-4
chrF++ SacreBLEU chrF++ with word_order=2
METEOR NLTK METEOR with WordNet synonyms
ROUGE-L rouge-score ROUGE-L with stemming
BERTScore sidecar bert-score F1
COMET sidecar Unbabel/wmt22-comet-da
BLEURT sidecar BLEURT checkpoint /home/stephanos/metric-envs/bleurt/BLEURT-20
R^2 by prompt version
R^2 by prompt version with best-fit line.
Mean BLEU-4 by prompt version
Mean BLEU-4 by prompt version with best-fit line.
Length slope by prompt version
Length slope by prompt version with best-fit line.

Metric vs Passage Length Patterns

Passage length is measured as the source Greek token count when source text is available, falling back to human translation word count only when a source passage has no Greek tokens. Each row regresses one metric against passage length for one prompt version.

Evaluable regressions33
Positive0
Negative33
Significant positive0
Significant negative32

Across 33 evaluable metric/prompt regressions, negative correlations are more common: 0 positive, 33 negative, and 0 flat. Using p < 0.05, 0 are significantly positive and 32 are significantly negative.

Prompt Metric Rows Pattern Pearson r R^2 P-value Slope Status
legacy_scholarly v1 BLEU-4 92 significant negative -0.249 0.062 0.0169 -0.00120 ok
legacy_scholarly v1 chrF++ 92 negative -0.201 0.041 0.0543 -0.00064 ok
legacy_scholarly v1 METEOR 92 significant negative -0.224 0.050 0.0317 -0.00097 ok
legacy_scholarly v1 ROUGE-L 92 significant negative -0.251 0.063 0.0159 -0.00097 ok
legacy_scholarly v1 BERTScore 92 significant negative -0.392 0.153 0.0001 -0.00028 ok
legacy_scholarly v1 COMET 92 significant negative -0.267 0.071 0.0101 -0.00050 ok
legacy_scholarly v1 BLEURT 92 significant negative -0.655 0.429 1.43e-12 -0.00213 ok
legacy_scholarly v1 Trigram precision 92 significant negative -0.206 0.042 0.0494 -0.00108 ok
legacy_scholarly v1 Trigram recall 92 significant negative -0.217 0.047 0.0375 -0.00116 ok
legacy_scholarly v1 Trigram F1 92 significant negative -0.211 0.044 0.0437 -0.00111 ok
legacy_scholarly v1 Trigram Jaccard 92 significant negative -0.217 0.047 0.0374 -0.00089 ok
legacy_scholarly v2 BLEU-4 89 significant negative -0.499 0.249 6.57e-07 -0.00299 ok
legacy_scholarly v2 chrF++ 89 significant negative -0.543 0.295 3.79e-08 -0.00193 ok
legacy_scholarly v2 METEOR 89 significant negative -0.548 0.300 2.71e-08 -0.00217 ok
legacy_scholarly v2 ROUGE-L 89 significant negative -0.492 0.242 1.00e-06 -0.00179 ok
legacy_scholarly v2 BERTScore 89 significant negative -0.670 0.449 6.72e-13 -0.00045 ok
legacy_scholarly v2 COMET 89 significant negative -0.488 0.238 1.24e-06 -0.00084 ok
legacy_scholarly v2 BLEURT 89 significant negative -0.757 0.573 9.09e-18 -0.00265 ok
legacy_scholarly v2 Trigram precision 89 significant negative -0.394 0.155 0.0001 -0.00276 ok
legacy_scholarly v2 Trigram recall 89 significant negative -0.414 0.172 5.45e-05 -0.00278 ok
legacy_scholarly v2 Trigram F1 89 significant negative -0.405 0.164 8.24e-05 -0.00276 ok
legacy_scholarly v2 Trigram Jaccard 89 significant negative -0.403 0.162 9.04e-05 -0.00285 ok
legacy_scholarly v3 BLEU-4 90 significant negative -0.375 0.141 0.0003 -0.00257 ok
legacy_scholarly v3 chrF++ 90 significant negative -0.481 0.231 1.64e-06 -0.00185 ok
legacy_scholarly v3 METEOR 90 significant negative -0.445 0.198 1.08e-05 -0.00180 ok
legacy_scholarly v3 ROUGE-L 90 significant negative -0.417 0.174 4.38e-05 -0.00167 ok
legacy_scholarly v3 BERTScore 90 significant negative -0.648 0.419 5.29e-12 -0.00048 ok
legacy_scholarly v3 COMET 90 significant negative -0.508 0.258 3.26e-07 -0.00086 ok
legacy_scholarly v3 BLEURT 90 significant negative -0.734 0.539 1.86e-16 -0.00256 ok
legacy_scholarly v3 Trigram precision 90 significant negative -0.289 0.084 0.0057 -0.00232 ok
legacy_scholarly v3 Trigram recall 90 significant negative -0.300 0.090 0.0040 -0.00235 ok
legacy_scholarly v3 Trigram F1 90 significant negative -0.295 0.087 0.0047 -0.00233 ok
legacy_scholarly v3 Trigram Jaccard 90 significant negative -0.322 0.103 0.0020 -0.00267 ok

Synthetic comparison to Zainaldi et al Galen translation

Zainaldi et al.'s Galen translation is represented here by its reported mean passage length of 220.5 words. The Stephanos values below are raw ordinary-least-squares predictions from the metric-vs-passage-length regressions above; they are not clamped to metric bounds.

Synthetic prompt versions vs Zainaldin aggregate scores
Synthetic prompt versions vs Zainaldin aggregate scores.
Synthetic minus Zainaldin aggregate score differences
Synthetic minus Zainaldin aggregate score differences.

Synthetic Stephanos metrics at 220.5 words

Prompt Synthetic passage words Observed source word range Extrapolated? BLEU-4chrF++METEORROUGE-LBERTScoreCOMETBLEURT
legacy_scholarly v1 220.5 5.000-181.000 yes 11.0%47.6%47.0%47.1%88.2%64.9%26.9%
legacy_scholarly v2 220.5 5.000-181.000 yes -0.4%39.4%39.5%46.8%88.1%64.8%27.2%
legacy_scholarly v3 220.5 5.000-181.000 yes 6.5%40.9%46.5%48.9%87.7%64.6%29.5%

Zainaldin et al. reported metrics

Table 1 of Zainaldin et al. 2026 reports these values as mean scores multiplied by 100; the table below displays the same scale as percentages.

Text Model BLEU-4chrF++METEORROUGE-LBERTScoreCOMETBLEURT
Mix. ChatGPT 31.4%53.4%46.4%50.9%91.0%79.9%49.8%
Mix. Claude 34.2%55.4%48.5%55.3%91.6%79.8%50.4%
Mix. Gemini 34.2%57.0%50.0%56.0%91.5%80.7%51.3%
Mix. Aggregate 33.3%55.3%48.3%54.1%91.4%80.1%50.5%
Comp. ChatGPT 15.7%47.4%40.1%45.7%89.1%75.1%42.6%
Comp. Claude 16.7%49.4%42.9%47.8%89.7%76.5%46.2%
Comp. Gemini 19.0%51.2%44.4%47.8%89.9%77.3%45.8%
Comp. Aggregate 17.1%49.3%42.5%47.1%89.5%76.3%44.9%

Prompt Version Summary

Prompt First translation Pairs Lemmas Slope Intercept R^2 Slope p |slope - 1| BLEU-4 chrF++ METEOR ROUGE-L BERTScore COMET BLEURT Trigram precision Trigram recall Trigram F1 Trigram Jaccard Mean abs residual
legacy_scholarly v1 2026-02-14 92 92 1.032 1.756 0.987 5.48e-87 0.032 33.7% 59.7% 65.4% 65.5% 93.4% 74.4% 67.2% 26.4% 28.3% 27.3% 15.8% 3.418
legacy_scholarly v2 2026-05-24 89 89 0.927 2.251 0.988 9.07e-86 0.073 56.0% 75.8% 80.6% 80.7% 96.7% 80.7% 77.3% 47.4% 46.2% 46.8% 30.5% 2.665
legacy_scholarly v3 2026-05-07 90 90 0.933 2.240 0.984 7.70e-81 0.067 55.1% 75.9% 80.5% 80.5% 96.7% 80.8% 77.9% 48.1% 47.2% 47.6% 31.3% 3.169

Downloadable Tables