Rescribe

A desktop tool for OCR, using Tesseract and modern, efficient preprocessing and analysis pipelines. It is optimised for historical printed works, and includes models for a variety of historical scripts.

Download

rescribe 0.5.1 for Windows (2021-08-30)

rescribe 0.5.1 for Mac (2021-08-30)

rescribe 0.5.1 for Mac (M1) (2021-08-30)

rescribe 0.5.1 for Linux (2021-08-30)

Just click to download the tool, and save it to the folder where you want to do your OCR.

Usage

Rescribe is a command-line tool, but don't worry, it isn’t hard to use!

Start by opening up a terminal window. If you’re on Windows, you can type cmd.exe into the run box, on OSX it’s under Applications → Utilities → Terminal, and if you’re on Linux I bet you already know where to find your terminal.

Firstly, navigate to the folder where you downloaded the tool, by running the cd command, for example cd Documents/OCR.

If you’re on Linux or OSX you will probably need to make the program executable after downloading it, so do that now by running chmod +x rescribe. You’ll only have to do that once.

You use rescribe by giving it the name of the directory containing the book or manuscript pages you want to OCR. Basic usage looks like this:

./rescribe mybook

screenshot

This will run rescribe over all pages in the directory mybook. A successful run will add several new files to mybook:

Rescribe contains a set of OCR models built in, and it defaults to one trained specifically for historic printed Latin books. To see the other models available, run ./rescribe by itself, and you will see the list. You can then choose an alternative model by using the -t flag, for example to use a model trained for Caroline Miniscule manuscripts, you would run:

./rescribe -t carolinemsv1_fast.traineddata mybook

If you have another model you would like to use, you can just put it in the same folder as rescribe and use its file name after -t.

Limitations

One limitation at the moment is that the rescribe tool is very sensitive to how page images are named. It will only work on pages named <anything>0001.png or <anything>0001.jpg, where 0001 is any four digit number (and <anything> is anything!).

Source code

Rescribe is published under the GPLv3 license, and source code can be found by cloning its git repository with git clone https://git.rescribe.xyz/bookpipeline. It's written in Go, and is easy to hack on, if you have any patches or questions, please send them along to info@rescribe.xyz.

Release log

v0.5.1 (2021-08-30) - windows | mac | mac m1 | linux
Improved scaling of embedded images in PDFs to ensure they are always legible, and changed to always encode them as JPEG to save space.
v0.5.0 (2021-08-19) - windows | mac | mac m1 | linux
Added M1 (arm64) build, fixed PDF word coordinates, improved PDF copy-paste output, scaled hidden text to perfectly match image.
v0.4.0 (2021-07-09) - windows | mac | linux
Embedded prebuilt Tesseract and training data into the binary so no dependencies are needed, improved error messages and increased robustness.
v0.3.3 (2021-05-16) - windows | mac | linux
Changed default training directory to trainings/, fixed a rare crash that could happen when preprocessing some images
v0.3.2 (2020-12-07) - windows | mac | linux
Allowed saving results to a different directory than the input directory, ensured that saving into a directory which already has output files works correctly, robustness improvements particularly on Windows.
v0.3.0 (2020-11-17) - windows | mac | linux
Initial public release of standalone tool.

More information

For more information on the inner workings of rescribe, take a look at our blog.