Durham Priory Library OCR Project

Thanks to a recent collaboration with the Durham Cathedral's Priory Library, Rescribe has completed a project creating OCR models with an output for four medieval manuscripts of over 90% accuracy rate. The manuscripts used were:

We plan to release the models we created, as well as the supporting software and data, very soon, on LatinOCR.org.

The Priory Library supplied Rescribe with high-resolution images of the chosen manuscripts, and Rescribe used the images to train OCR models and machine-readable output for each manuscript.

The project had several objectives at its core: firstly, we wanted to resort to using exclusively open source tools to train an OCR model for the manuscripts at hand; secondly, we were working on as time-efficient a model-building process as possible, finding the minimum amount of ground truth necessary for an effective model; and figuring out a process to expand abbreviations and glyphs in medieval manuscripts as accurately as possible.

The program of choice for the project was OCRopus, an open source program with extreme adaptability for a variety of purposes. The single steps – binarization, segmentation, training, recognition – are all divided up and can be altered individually. Since we used different open source image preprocessing programs – ScanTailor and Gimp – for our pre-processing purposes, we left out OCRopus’ binarization step, for example. Since the original OCRopus did not contain all the Unicode characters used for medieval manuscripts, we adjusted both training and recognition script to include the same. The complete character set used for the occasion was the following:

[ ~ ! & ' ( ) * , . 0 1 2 3 4 5 6 7 8 9 : ; < = > ? A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ a b c d e f g h i j k l m n o p q r s t u v w x y z { } § « ¶ · » Ã Æ Õ ß à â ã æ è ê ì î ò ô õ ÷ ù û Ā ā đ Ē ę ē Ĩ ĩ Ī ī Ō ō Œ œ Ũ ũ Ū ū ǣ ǽ Ẽ ẽ ‚ † ‡ ‹ › ※ ⸗ ꝑ ꝓ ꝗ ꝙ ꝛ ꝝ ꝫ ꝯ ꝶ ] ꝗͣ r̄ t̄ pͣ tͣ qͥ qͦ gͦ qͣ mͦ uͦ Vͥ pͥ nͥ gͥ n̄ p̄ mͥ qͣ p̄ c̄ tͥ m̄ q̄ x̄ s̄ n̄ ł ƀ ľ ę ħ ƀ ľ ł ħ ƀ

The initial output of the OCR recognition process preserved these characters. We further processed the documents in two steps: by first expanding “non-standard” formations; i.e. xpm, ihs, eccla, gra; in a second step, the "standard-expansions" where added. This means that ambiguous abbreviations like ā could be expanded to either "am" as a standard abbreviation, or "an" or something entirely different, context-related, as a non-standard abbreviation.

For the three manuscripts displayed, following models were trained:

By combining the ground truth for MS B.II.18 and MS B.III.1, better results with fewer pages of ground truth could be achieved; the combination of MS B.II.18 and MS B.III.1 with MS A.II.4, however, did not yield a functional model for MS A.II.4, so an individual model was built for that purpose.

The results ranged within a character error rate of ca. 0-10%, meaning, the resulting pages had a word accuracy of roughly 90% and more.

OCR Samples

Durham Cathedral Library MS. B.II.16 - Augustine, In Johannis Evangelium, page f.4r

OCR Results

hoc dicetis qꝛ ego uoƀ sū p̄sentior quā d~ A bsit. Multo ē ille p̄sentior. Nā ego ocul ur̄is appa
re ille c̄scientiis ur̄is p̄sidet. Ad me aures ad illū cor. ut utrunq; impleatis. Ecce oculos uros
et sensus istos corporis. leuatis ad nos. Nec ad nos. Non enī nos de illis montib:. Sed ad ipsū euan
geliū ad ipsū euangelistā. cor aut̄ implendū ad dn̄m. Et ui qͥsq; sic leuet . ut uideat qͥd
leuet. et qͥ leuet. qͥd dixi. qͥd leuet. et a qͦ leuet ? Quale cor leuat uideat. qꝛ ad dn̄m leuat. ne sar
cina uoluptatis carnalis p̄ grauatū ante cadat quā fuerit subleuatū. Sed uidet se gestare
onus carnis. det oꝑram ꝑ tinentiā ut purget qͦd leuet ad dn̄ m. Seati menī mundo cor d e. qm

OCR Results (expanded abbreviations)

hoc dicetis quia ego uobis sum praesentior quam d~ A bsit. Multo est ille praesentior. Nam ego ocul uestris appa
re ille conscientiis uestris praesidet. Ad me aures ad illum cor. ut utrunque impleatis. Ecce oculos uros
et sensus istos corporis. leuatis ad nos. Nec ad nos. Non enim nos de illis montibus. Sed ad ipsum euan
gelium ad ipsum euangelistam. cor auter implendum ad dominum. Et ui quisque sic leuet . ut uideat quid
leuet. et qui leuet. quid dixi. quid leuet. et a quo leuet ? Quale cor leuat uideat. qet ad dominum leuat. ne sar
cina uoluptatis carnalis prae grauatum ante cadat quam fuerit subleuatum. Sed uidet se gestare
onus carnis. det operram per tinentiam ut purget quod leuet ad dnon m. Seati menim mundo cor d e. qm

Durham Cathedral Library MS. B.II.18 - Augustine, Sermons on the Gospels, page f.5v

OCR Results

placerē xp̄i seruus n̄ ēc̄m. assit ipse do
mins qͥ etiā in suo seruo et aplo lo que
batͣ. et aꝑiat nobis uoluntatē suā. et
tribuat obedicndi facultatē . Ipsa qͥppe
uerba euangłica secũ portant etpo
sitiones suas . nec int̄ cludt̄ ora esuri
entiũ . qͥa pascunt corda pulsantium
didit e głificent patrē urm qͥ in cęlii
est. s. ꝑots addidit . alioquin mer
cedē n̄ habebitis apđ patrē ur̄m qͥ in
cęlis est. ninc enī ostendit eos qͥ ta
les st̄ qͣles fideles suos ēē n̄ uult. ineo
ipso mercedē quęrere g uident al
hominibꝫ . ibi c̄stituerę bonũ suum

OCR Results (expanded abbreviations)

placerem Christi seruus non essem. assit ipse do
mins qui etiam in suo seruo et aplo lo que
batra. et aperiat nobis uoluntatem suam. et
tribuat obedicndi facultatem . Ipsa quippe
uerba euangelica secum portant etpo
sitiones suas . nec inter cludit ora esuri
entium . quia pascunt corda pulsantium
didit e glorificent patrem urm qui in caelii
est. s. perots addidit . alioquin mer
cedem non habebitis apud patrem uestrum qui in
caelis est. ninc enim ostendit eos qui ta
les ster quales fideles suos esse non uult. ineo
ipso mercedem quaerere g uident al
hominibus . ibi constituere bonum suum

Durham Cathedral Library MS. B.III.1 - Origen, Old Testament homilies, page f.3v

OCR Results

ꝑallegoriā debeant exposuimus cū diximus iuberi aꝗͣ
id est mentē eius sensū ꝓducere. et t̄rā sensū carnis
ꝓferre. ut dn̄etur eis mens. et n̄ illa dn̄entr ei. Uult
enī d~ ut magna ista dī factura hominis ꝑpt̄ quē
et uniuers creats ē mund. n̄ solū īmaculatus sit ab
his quę supͣ diximus et immunis. s et dn̄etur eis. Iā uͦ qͣle
animal sit homo. ex ipsis scripturę sermonib; c̄si
demus. Om̄s reliquę creaturę iussione dī fiunt di
cente scriptura. t di~ d~ fiat firmamtū.t di~ d~
qꝛ cęlū mͥ sedes est. t̄ra aut̄ scabellū pedū meoꝝꝫ su
spicantur dm̄ tā ingentis eē corporis.ut putet eum
sedentē incęlo pedes usq; adt̄rā ꝓtende. Hoc autē
sentiunt. qꝛ n̄ habent illas aures quę dignepossint
audire uerba dī de dō. ~ ꝑscripturā referuntur đ
enī dicit cęlū mͥ sedes ē. ita digne de eo intelligitu
ut sciamus qͥa inhis quoꝝ incęlis ē. c̄uersatio d~ reqͥ
escit et residet. In his autē qͥ a huc t̄renū ꝓpositũ
gerunt . ultima pars eiusꝓuidentię inuenitu. qđ in

OCR Results (expanded abbreviations)

perallegoriam debeant exposuimus cum diximus iuberi aquae
id est mentem eius sensum producere. et terram sensum carnis
proferre. ut dnonetur eis mens. et non illa dnonentr ei. Uult
enim d~ ut magna ista dim factura hominis propter quem
et uniuers creats em mund. non solum immaculatus sit ab
his quae supra diximus et immunis. s et dnonetur eis. Iam uero quale
animal sit homo. ex ipsis scripturae sermonibus consi
demus. Omnes reliquae creaturae iussione dei fiunt di
cente scriptura. t di~ d~ fiat firmamtum.t di~ d~









quia caelum mihi sedes est. terra auter scabellum pedum meorumus su
spicantur dmen tam ingentis esse corporis.ut putet eum
sedentem incaelo pedes usque adterram protende. Hoc autem
sentiunt. qet non habent illas aures quae dignepossint
audire uerba dim de deo. ~ perscripturam referuntur de
enim dicit cęlum mihi sedes em. ita digne de eo intelligitu
ut sciamus quia inhis quorum incęlis em. conuersatio d~ requi
escit et residet. In his autem qui a huc terrenum propositum
gerunt . ultima pars eiusprouidentiae inuenitu. qde in

Durham Cathedral Library MS. A.II.4 - Bible of William of St Calais, page f.14r

OCR Results

in bethel non adicies ultra ut ꝓphetes quia
scificatio regis. ē et domus regni ē. et respon
dit amos: et dixit ad amasu. Non sũ ꝓpha et
n sũ fili ꝓpƀę: s armentaris egto sū uellicans
sicomoros eu tulit me dn̄s eũ sequerer gre
gem et dixit ad me. Uade. ꝓpba ad popłm
meũ isrł et nunc audi īibũ ni. Tu dicis non
ꝓphabi suꝑ isrł et nion stullabis suꝑ domũ
idoli pt̄ hoc haec dicit dn̄s aor tua in
ciuitate fornicabitur et filii tui et filię tuę
eos. E si absconditi fuerit in uertice carmeli;
inde scrutans aufera eos. E si celerauerint se
ab ocłis meis in fundo maris. ibi man dibo
serpenti. et mordebit eos \& si abierint in ca
ptiitatē cora inimicis suis; ibi manchabo
gładio. et occidet eo: et ponam oculosmeos
nꝑ eros inmalũ et n̄ inbonũ. dt dn̄s đa
tuiis exercituũ qiii tangit t̄rā et tabescet.
et lugebunt om̄s habitantes in ea et ascendax;
sicut ruis onīs. et defluet sicut fluuius

OCR Results (expanded abbreviations)

in bethel non adicies ultra ut prophetes quia
scificatio regis. est et domus regni est. et respon
dit amos: et dixit ad amasu. Non sum propheta et
n sum fili propberae: s armentaris egto sum uellicans
sicomoros eu tulit me dominus eum sequerer gre
gem et dixit ad me. Uade. propba ad popłm
meum israel et nunc audi imibum ni. Tu dicis non
prophabi super israel et nion stullabis super domum
idoli pter hoc haec dicit dominus aor tua in
ciuitate fornicabitur et filii tui et filiae tuae
eos. Et si absconditi fuerit in uertice carmeli;
inde scrutans aufera eos. Et si celerauerint se
ab oculis meis in fundo maris. ibi man dibo
serpenti. et mordebit eos & si abierint in ca
ptiitatem cora inimicis suis; ibi manchabo
gładio. et occidet eo: et ponam oculosmeos
nper eros inmalum et non inbonum. dt dominus dea
tuiis exercituum qiii tangit terram et tabescet.
et lugebunt omnis habitantes in ea et ascendax;
sicut ruis onims. et defluet sicut fluuius