Rescribe is an easy-to-use desktop tool for performing OCR on image files, PDFs and Google Books. It uses the Tesseract OCR engine, combined with modern and efficient preprocessing and analysis pipelines, to produce high quality output. The tool has been built with a focus on OCR of historical printed works, but it includes modern language options and also works well on modern printed works.
rescribe 1.0.0 for Windows (2022-03-22)
rescribe 1.0.0 for Mac (2022-03-22)
rescribe 1.0.0 for Linux (2022-03-22)
Additional languages / scripts (optional)
Simply choose a source – a folder, PDF file or Google Book –, select the appropriate language/script from the dropdown and hit “Start OCR”.
Mac Users may have to authorize the app for use in their Security settings after downloading the app.
Rescribe can also be used as a command-line tool, if you are so inclined, see command line usage for more details.
Once complete, rescribe creates several new files for each book:
mybook searchable.pdf
), which is fully searchable.text
directory, containing plain text versions of the OCR results for each page.hocr
directory, containing hOCR formatted OCR results for each page.graph.png
file, which shows the OCR confidence of each page (a rough indicator of the quality of the OCR over the book).conf
file, which lists the OCR confidence of each page, at each preprocessing binarisation threshold attempted.This software builds on previous work for the ERC-sponsored research project “The Normalization of Natural Philosophy” (University of Groningen). Further work on v1.0.0, in particular the graphical interface, was funded by a number of wonderful and generous people on Kickstarter. A huge thank you to everyone, in particular:
Ivy Livingston, Rachel Gruber, davidak, Jennifer Dekker, Ray Berger, Zdeněk Pavlátka, Damon, Kyle Foley, May, Gregory Fuller, Molly Ceglowski, Christopher Lu, Jorge Cajiao, John Levin, Brandon W. Hawk, Phillip Staniczenko, Rexforr
There are some PDFs that Rescribe can't extract images from successfully. If this happens, you will need to extract the images into a folder yourself, and choose that folder in the tool to OCR. You can use an online PDF image extractor to do this if necessary.
There is a known bug where input files with spaces in the filename can sometimes cause the process to fail. If you're having trouble, try renaming files to remove any spaces, for example rename My Book.pdf
to MyBook.pdf
.
If the app keeps failing for unintelligible reasons, please do get in touch at info@rescribe.xyz. If possible, include any messages posted to the “Log” section of the tool, to help us figure out what's up.
Rescribe is published under the GPLv3 license, and source code can be found by cloning its git repository with git clone https://git.rescribe.xyz/bookpipeline
, or on github. It's written in Go, and is easy to hack on, if you have any patches or questions, please send them along to info@rescribe.xyz.
For more information on the inner workings of rescribe, take a look at our blog.