Rescribe

A desktop tool for OCR, using Tesseract and modern, efficient preprocessing and analysis pipelines. It is optimised for historical printed works, and includes models for a variety of historical scripts.

Download

rescribe 0.4.0 for Windows (2021-07-09)

rescribe 0.4.0 for Mac OS X (2021-07-09)

rescribe 0.4.0 for Linux (2021-07-09)

Just click to download the tool, and save it to the folder where you want to do your OCR.

Usage

Rescribe is a command-line tool, but don't worry, it isn’t hard to use!

Start by opening up a terminal window. If you’re on Windows, you can type cmd.exe into the run box, on OSX it’s under Applications → Utilities → Terminal, and if you’re on Linux I bet you already know where to find your terminal.

Firstly, navigate to the folder where you downloaded the tool, by running the cd command, for example cd Documents/OCR.

If you’re on Linux or OSX you will probably need to make the program executable after downloading it, so do that now by running chmod +x rescribe. You’ll only have to do that once.

You use rescribe by giving it the name of the directory containing the book or manuscript pages you want to OCR. Basic usage looks like this:

./rescribe mybook

screenshot

This will run rescribe over all pages in the directory mybook. A successful run will add several new files to mybook:

Rescribe contains a set of OCR models built in, and it defaults to one trained specifically for historic printed Latin books. To see the other models available, run ./rescribe by itself, and you will see the list. You can then choose an alternative model by using the -t flag, for example to use a model trained for Caroline Miniscule manuscripts, you would run:

./rescribe -t carolinemsv1_fast.traineddata mybook

If you have another model you would like to use, you can just put it in the same folder as rescribe and use its file name after -t.

Limitations

One limitation at the moment is that rescribe is very sensitive to how page images are named. It will only work on pages named <anything>0001.png or <anything>0001.jpg, where 0001 is any four digit number (and <anything> is anything!).

Source code

Rescribe is published under the GPLv3 license, and source code can be found by cloning the git repository at https://git.rescribe.xyz/bookpipeline.

More information

For more information on the inner workings of rescribe, take a look at our blog.