Rescribe

Rescribe is an easy-to-use desktop tool for performing OCR on image files, PDFs and Google Books. It uses the Tesseract OCR engine, combined with modern and efficient preprocessing and analysis pipelines, to produce high quality output. The tool has been built with a focus on OCR of historical printed works, but it includes modern language options and also works well on modern printed works.

screenshot

Download

rescribe 1.0.0 for Windows (2022-03-22)

rescribe 1.0.0 for Mac (2022-03-22)

rescribe 1.0.0 for Linux (2022-03-22)

Additional languages / scripts (optional)

Usage

Simply choose a source – a folder, PDF file or Google Book –, select the appropriate language/script from the dropdown and hit “Start OCR”.

Mac Users may have to authorize the app for use in their Security settings after downloading the app.

Rescribe can also be used as a command-line tool, if you are so inclined, see command line usage for more details.

Once complete, rescribe creates several new files for each book:

Thanks

This software builds on previous work for the ERC-sponsored research project “The Normalization of Natural Philosophy” (University of Groningen). Further work on v1.0.0, in particular the graphical interface, was funded by a number of wonderful and generous people on Kickstarter. A huge thank you to everyone, in particular:
Ivy Livingston, Rachel Gruber, davidak, Jennifer Dekker, Ray Berger, Zdeněk Pavlátka, Damon, Kyle Foley, May, Gregory Fuller, Molly Ceglowski, Christopher Lu, Jorge Cajiao, John Levin, Brandon W. Hawk, Phillip Staniczenko, Rexforr

Troubleshooting

There are some PDFs that Rescribe can't extract images from successfully. If this happens, you will need to extract the images into a folder yourself, and choose that folder in the tool to OCR. You can use an online PDF image extractor to do this if necessary.

There is a known bug where input files with spaces in the filename can sometimes cause the process to fail. If you're having trouble, try renaming files to remove any spaces, for example rename My Book.pdf to MyBook.pdf.

If the app keeps failing for unintelligible reasons, please do get in touch at info@rescribe.xyz. If possible, include any messages posted to the “Log” section of the tool, to help us figure out what's up.

Source code

Rescribe is published under the GPLv3 license, and source code can be found by cloning its git repository with git clone https://git.rescribe.xyz/bookpipeline, or on github. It's written in Go, and is easy to hack on, if you have any patches or questions, please send them along to info@rescribe.xyz.

Release log

v1.0.0 (2022-03-22) - windows | mac | linux
Thanks to our fabulous Kickstarter backers, lots of improvements! Added GUI, added PDF extractor, added Google Book downloader, created a single binary for OSX for M1 & amd64, added file renamer so page files no longer need a particular naming format, added option to disable page wiping, added option to create full size PDF.
v0.5.3 (2021-10-18) - windows | mac | mac m1 | linux
Fix issue loading external training file.
v0.5.2 (2021-10-01) - windows | mac | mac m1 | linux
Added embedded Latin training for modern printed works.
v0.5.1 (2021-08-30) - windows | mac | mac m1 | linux
Improved scaling of embedded images in PDFs to ensure they are always legible, and changed to always encode them as JPEG to save space.
v0.5.0 (2021-08-19) - windows | mac | mac m1 | linux
Added M1 (arm64) build, fixed PDF word coordinates, improved PDF copy-paste output, scaled hidden text to perfectly match image.
v0.4.0 (2021-07-09) - windows | mac | linux
Embedded prebuilt Tesseract and training data into the binary so no dependencies are needed, improved error messages and increased robustness.
v0.3.3 (2021-05-16) - windows | mac | linux
Changed default training directory to trainings/, fixed a rare crash that could happen when preprocessing some images
v0.3.2 (2020-12-07) - windows | mac | linux
Allowed saving results to a different directory than the input directory, ensured that saving into a directory which already has output files works correctly, robustness improvements particularly on Windows.
v0.3.0 (2020-11-17) - windows | mac | linux
Initial public release of standalone tool.

More information

For more information on the inner workings of rescribe, take a look at our blog.