Rescribe Ltd offers custom-tailored Optical Character Recognition (OCR) Services for historical texts. Our OCR training packages are designed for the Tesseract and OCRopus engines and can be downloaded and used for free from latinocr.org and github.com. For more information on our commercial service, read on or contact firstname.lastname@example.org.
Our service is comprised of three steps: Preprocessing and OCR of images or PDFs and postprocessing the resulting output. Upon request, the output can be proofread for maximum accuracy.
Preprocessing the scans or image files serves to deskew pages, reduce noise in the scans and convert the result into a binarized image which can be processed by Tesseract, our OCR engine.
In the second step we run the OCR on the preprocessed files, using our specifically trained packages and adapting language and character settings to the document at hand.
The resulting output is further refined in an automated postprocessing step: special characters and ligatures are optionally expanded and adapted to modern typography standards (e.g. æ to ae). These settings can be adjusted to the customer’s preferences. The final output can be delivered as raw text files, searchable PDFs or hOCR format.
We naturally aim to deliver an output of optimum accuracy; however, the quality can depend on a number of factors beyond our control (such as quality of page and scan, fonts and outerwordly characters). A set of analytical tools allows us to quickly assess the relative quality of the output and devote more attention to problem areas. Where a higher accuracy is required, the output can be optionally proofread to ensure that requirements are met.
We charge an hourly rate for our services which allows for combining different steps of the process to suit both needs and budget. Whilst Preprocessing and OCR is always needed, the manual proofreading is optional and charged at a lesser rate. For an estimate, contact us at email@example.com with a page sample, a description of the document to be processed and your requirements of the output.
- Durham Priory Library, Various manuscripts (2017)
- Middle Temple Library, Gabriel Powel's De adiaphoris theses theologicæ ac scholasticæ (2016)
Rescribe Ltd is a not-for-profit, spin-out company based on research carried out at Durham University. The company formation and initial development of the software have been funded by a Proof-of-Concept grant awarded by the European Research Council, on the basis of research developed as part of the project Living Poets: A New Approach to Ancient Poetry, directed by Prof. Barbara Graziosi and funded by the European Research Council.