Durham Priory Library OCR Project
Thanks to a recent collaboration with the Durham Cathedral's Priory Library, Rescribe has completed a project creating OCR models with an output for four medieval manuscripts of over 90% accuracy rate. The manuscripts used were:
- MS A.II.4: Bible of William of St Calais
- MS B.II.16: Augustine, In Johannis Evangelium
- MS B.II.18: Augustine, Sermones de verbis Domini et Apostoli
- MS B.III.1: Origen, Homilies on Old Testament
We plan to release the models we created, as well as the supporting software and data, very soon, on LatinOCR.org.
The Priory Library supplied Rescribe with high-resolution images of the chosen manuscripts, and Rescribe used the images to train OCR models and machine-readable output for each manuscript.
The project had several objectives at its core: firstly, we wanted to resort to using exclusively open source tools to train an OCR model for the manuscripts at hand; secondly, we were working on as time-efficient a model-building process as possible, finding the minimum amount of ground truth necessary for an effective model; and figuring out a process to expand abbreviations and glyphs in medieval manuscripts as accurately as possible.
The program of choice for the project was OCRopus, an open source program with extreme adaptability for a variety of purposes. The single steps – binarization, segmentation, training, recognition – are all divided up and can be altered individually. Since we used different open source image preprocessing programs – ScanTailor and Gimp – for our pre-processing purposes, we left out OCRopus’ binarization step, for example. Since the original OCRopus did not contain all the Unicode characters used for medieval manuscripts, we adjusted both training and recognition script to include the same. The complete character set used for the occasion was the following:
[ ~ ! & ' ( ) * , . 0 1 2 3 4 5 6 7 8 9 : ; < = > ? A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ a b c d e f g h i j k l m n o p q r s t u v w x y z { } § « ¶ · » Ã Æ Õ ß à â ã æ è ê ì î ò ô õ ÷ ù û Ā ā đ Ē ę ē Ĩ ĩ Ī ī Ō ō Œ œ Ũ ũ Ū ū ǣ ǽ Ẽ ẽ ‚ † ‡ ‹ › ※ ⸗ ꝑ ꝓ ꝗ ꝙ ꝛ ꝝ ꝫ ꝯ ꝶ ] ꝗͣ r̄ t̄ pͣ tͣ qͥ qͦ gͦ qͣ mͦ uͦ Vͥ pͥ nͥ gͥ n̄ p̄ mͥ qͣ p̄ c̄ tͥ m̄ q̄ x̄ s̄ n̄ ł ƀ ľ ę ħ ƀ ľ ł ħ ƀ
The initial output of the OCR recognition process preserved these characters. We further processed the documents in two steps: by first expanding “non-standard” formations; i.e. xpm, ihs, eccla, gra; in a second step, the "standard-expansions" where added. This means that ambiguous abbreviations like ā could be expanded to either "am" as a standard abbreviation, or "an" or something entirely different, context-related, as a non-standard abbreviation.
For the three manuscripts displayed, following models were trained:
- MS A.II.4: 4 pages (304 lines)
- MS B.II.16: 4 pages (139 lines)
- MS B.II.18: 2 pages (157 lines) + MS B.III.1: 2 pages (150 lines) = 4 pages (307 lines)
By combining the ground truth for MS B.II.18 and MS B.III.1, better results with fewer pages of ground truth could be achieved; the combination of MS B.II.18 and MS B.III.1 with MS A.II.4, however, did not yield a functional model for MS A.II.4, so an individual model was built for that purpose.
The results ranged within a character error rate of ca. 0-10%, meaning, the resulting pages had a word accuracy of roughly 90% and more.
OCR Samples
Durham Cathedral Library MS. B.II.16 - Augustine, In Johannis Evangelium, page f.4r
OCR Results
re ille c̄scientiis ur̄is p̄sidet. Ad me aures ad illū cor. ut utrunq; impleatis. Ecce oculos uros
et sensus istos corporis. leuatis ad nos. Nec ad nos. Non enī nos de illis montib:. Sed ad ipsū euan
geliū ad ipsū euangelistā. cor aut̄ implendū ad dn̄m. Et ui qͥsq; sic leuet . ut uideat qͥd
leuet. et qͥ leuet. qͥd dixi. qͥd leuet. et a qͦ leuet ? Quale cor leuat uideat. qꝛ ad dn̄m leuat. ne sar
cina uoluptatis carnalis p̄ grauatū ante cadat quā fuerit subleuatū. Sed uidet se gestare
onus carnis. det oꝑram ꝑ tinentiā ut purget qͦd leuet ad dn̄ m. Seati menī mundo cor d e. qm
OCR Results (expanded abbreviations)
re ille conscientiis uestris praesidet. Ad me aures ad illum cor. ut utrunque impleatis. Ecce oculos uros
et sensus istos corporis. leuatis ad nos. Nec ad nos. Non enim nos de illis montibus. Sed ad ipsum euan
gelium ad ipsum euangelistam. cor auter implendum ad dominum. Et ui quisque sic leuet . ut uideat quid
leuet. et qui leuet. quid dixi. quid leuet. et a quo leuet ? Quale cor leuat uideat. qet ad dominum leuat. ne sar
cina uoluptatis carnalis prae grauatum ante cadat quam fuerit subleuatum. Sed uidet se gestare
onus carnis. det operram per tinentiam ut purget quod leuet ad dnon m. Seati menim mundo cor d e. qm
Durham Cathedral Library MS. B.II.18 - Augustine, Sermons on the Gospels, page f.5v
OCR Results
mins qͥ etiā in suo seruo et aplo lo que
batͣ. et aꝑiat nobis uoluntatē suā. et
tribuat obedicndi facultatē . Ipsa qͥppe
uerba euangłica secũ portant etpo
sitiones suas . nec int̄ cludt̄ ora esuri
entiũ . qͥa pascunt corda pulsantium
est. s. ꝑots addidit . alioquin mer
cedē n̄ habebitis apđ patrē ur̄m qͥ in
cęlis est. ninc enī ostendit eos qͥ ta
les st̄ qͣles fideles suos ēē n̄ uult. ineo
ipso mercedē quęrere g uident al
hominibꝫ . ibi c̄stituerę bonũ suum
OCR Results (expanded abbreviations)
mins qui etiam in suo seruo et aplo lo que
batra. et aperiat nobis uoluntatem suam. et
tribuat obedicndi facultatem . Ipsa quippe
uerba euangelica secum portant etpo
sitiones suas . nec inter cludit ora esuri
entium . quia pascunt corda pulsantium
est. s. perots addidit . alioquin mer
cedem non habebitis apud patrem uestrum qui in
caelis est. ninc enim ostendit eos qui ta
les ster quales fideles suos esse non uult. ineo
ipso mercedem quaerere g uident al
hominibus . ibi constituere bonum suum
Durham Cathedral Library MS. B.III.1 - Origen, Old Testament homilies, page f.3v
OCR Results
id est mentē eius sensū ꝓducere. et t̄rā sensū carnis
ꝓferre. ut dn̄etur eis mens. et n̄ illa dn̄entr ei. Uult
enī d~ ut magna ista dī factura hominis ꝑpt̄ quē
et uniuers creats ē mund. n̄ solū īmaculatus sit ab
his quę supͣ diximus et immunis. s et dn̄etur eis. Iā uͦ qͣle
animal sit homo. ex ipsis scripturę sermonib; c̄si
demus. Om̄s reliquę creaturę iussione dī fiunt di
cente scriptura. t di~ d~ fiat firmamtū.t di~ d~
spicantur dm̄ tā ingentis eē corporis.ut putet eum
sedentē incęlo pedes usq; adt̄rā ꝓtende. Hoc autē
sentiunt. qꝛ n̄ habent illas aures quę dignepossint
audire uerba dī de dō. ~ ꝑscripturā referuntur đ
enī dicit cęlū mͥ sedes ē. ita digne de eo intelligitu
ut sciamus qͥa inhis quoꝝ incęlis ē. c̄uersatio d~ reqͥ
escit et residet. In his autē qͥ a huc t̄renū ꝓpositũ
gerunt . ultima pars eiusꝓuidentię inuenitu. qđ in
OCR Results (expanded abbreviations)
id est mentem eius sensum producere. et terram sensum carnis
proferre. ut dnonetur eis mens. et non illa dnonentr ei. Uult
enim d~ ut magna ista dim factura hominis propter quem
et uniuers creats em mund. non solum immaculatus sit ab
his quae supra diximus et immunis. s et dnonetur eis. Iam uero quale
animal sit homo. ex ipsis scripturae sermonibus consi
demus. Omnes reliquae creaturae iussione dei fiunt di
cente scriptura. t di~ d~ fiat firmamtum.t di~ d~
spicantur dmen tam ingentis esse corporis.ut putet eum
sedentem incaelo pedes usque adterram protende. Hoc autem
sentiunt. qet non habent illas aures quae dignepossint
audire uerba dim de deo. ~ perscripturam referuntur de
enim dicit cęlum mihi sedes em. ita digne de eo intelligitu
ut sciamus quia inhis quorum incęlis em. conuersatio d~ requi
escit et residet. In his autem qui a huc terrenum propositum
gerunt . ultima pars eiusprouidentiae inuenitu. qde in
Durham Cathedral Library MS. A.II.4 - Bible of William of St Calais, page f.14r
OCR Results
scificatio regis. ē et domus regni ē. et respon
dit amos: et dixit ad amasu. Non sũ ꝓpha et
n sũ fili ꝓpƀę: s armentaris egto sū uellicans
sicomoros eu tulit me dn̄s eũ sequerer gre
gem et dixit ad me. Uade. ꝓpba ad popłm
meũ isrł et nunc audi īibũ ni. Tu dicis non
ꝓphabi suꝑ isrł et nion stullabis suꝑ domũ
idoli pt̄ hoc haec dicit dn̄s aor tua in
ciuitate fornicabitur et filii tui et filię tuę
inde scrutans aufera eos. E si celerauerint se
ab ocłis meis in fundo maris. ibi man dibo
serpenti. et mordebit eos \& si abierint in ca
ptiitatē cora inimicis suis; ibi manchabo
gładio. et occidet eo: et ponam oculosmeos
nꝑ eros inmalũ et n̄ inbonũ. dt dn̄s đa
tuiis exercituũ qiii tangit t̄rā et tabescet.
et lugebunt om̄s habitantes in ea et ascendax;
sicut ruis onīs. et defluet sicut fluuius
OCR Results (expanded abbreviations)
scificatio regis. est et domus regni est. et respon
dit amos: et dixit ad amasu. Non sum propheta et
n sum fili propberae: s armentaris egto sum uellicans
sicomoros eu tulit me dominus eum sequerer gre
gem et dixit ad me. Uade. propba ad popłm
meum israel et nunc audi imibum ni. Tu dicis non
prophabi super israel et nion stullabis super domum
idoli pter hoc haec dicit dominus aor tua in
ciuitate fornicabitur et filii tui et filiae tuae
inde scrutans aufera eos. Et si celerauerint se
ab oculis meis in fundo maris. ibi man dibo
serpenti. et mordebit eos & si abierint in ca
ptiitatem cora inimicis suis; ibi manchabo
gładio. et occidet eo: et ponam oculosmeos
nper eros inmalum et non inbonum. dt dominus dea
tuiis exercituum qiii tangit terram et tabescet.
et lugebunt omnis habitantes in ea et ascendax;
sicut ruis onims. et defluet sicut fluuius