Rescribe

Rescribe is an easy-to-use desktop tool for performing OCR on image files, PDFs and Google Books. It uses the Tesseract OCR engine, combined with modern and efficient preprocessing and analysis pipelines, to produce high quality output. The tool has been built with a focus on OCR of historical printed works, but it includes modern language options and also works well on modern printed works.

Download

rescribe 1.3.0 for Windows (2024-04-10)

rescribe 1.3.0 for Mac (2024-04-10)

rescribe 1.3.0 for Linux (alternative builds: wayland | flathub) (2024-04-10)

Additional languages / scripts (optional)

Usage

Simply choose a source – a folder, PDF file or Google Book –, select the appropriate language/script from the dropdown and hit “Start OCR”.

Mac Users may have to authorize the app for use in their Security settings after downloading the app. See below if you have difficulties.

Rescribe can also be used as a command-line tool, if you are so inclined, see command line usage for more details.

Once complete, rescribe creates several new files for each book:

A PDF file named after the directory (e.g. mybook searchable.pdf), which is fully searchable.
A text file named after the directory (e.g. mybook.txt), which contains the full text from all pages in one file.
A text directory, containing plain text versions of the OCR results for each page.
A hocr directory, containing hOCR formatted OCR results for each page.
A graph.png file, which shows the OCR confidence of each page (a rough indicator of the quality of the OCR over the book).
A conf file, which lists the OCR confidence of each page, at each preprocessing binarisation threshold attempted.

Thanks

This software builds on previous work for the ERC-sponsored research project “The Normalization of Natural Philosophy” (University of Groningen). Further work on v1.0.0, in particular the graphical interface, was funded by a number of wonderful and generous people on Kickstarter. A huge thank you to everyone, in particular:
Ivy Livingston, Rachel Gruber, davidak, Jennifer Dekker, Ray Berger, Zdeněk Pavlátka, Damon, Kyle Foley, May, Gregory Fuller, Molly Ceglowski, Christopher Lu, Jorge Cajiao, John Levin, Brandon W. Hawk, Phillip Staniczenko, Rexforr

Troubleshooting

There are some PDFs that Rescribe can't extract images from successfully. If this happens, you will need to extract the images into a folder yourself, and choose that folder in the tool to OCR. You can use an online PDF image extractor to do this if necessary.

When you try to open Rescribe on MacOS for the first time, you may get a pop up with a message like this: "Rescribe" can't be opened because Apple cannot check it for malicious software. This is because we haven't paid Apple to be part of their developer programme. You can bypass it by holding the Control key and clicking on the app icon, then clicking Open. The warning message that then pops up will have an "Open" button. Click it to open Rescribe. You will only have to do this once; after that it will open as normal.

There was a known bug (fixed in v1.2.0) where input files with spaces in the filename can sometimes cause the process to fail. If you're having trouble, either update to v1.2.0 or later, or try renaming files to remove any spaces, for example rename Page 1.jpg to Page1.jpg.

If the app keeps failing for unintelligible reasons, please do get in touch at info@rescribe.xyz. If possible, include any messages posted to the “Log” section of the tool, to help us figure out what's up.

Source code

Rescribe is published under the GPLv3 license, and source code can be found by cloning its git repository with git clone https://git.rescribe.xyz/bookpipeline, or on github. It's written in Go, and is easy to hack on, if you have any patches or questions, please send them along to info@rescribe.xyz.

Release log

v1.3.0 (2024-04-10) - windows | mac | linux | linux (wayland): Added support for PDFs with embedded rotated images, created new icon, added version number to MacOS and Windows builds, fixed possible crash on Windows caused by temporary jpegs not being closed before removal, fixed issue where an invalid PDF could be created in some cases, fixed issue with flatpak where not all trainings were available.
v1.2.0 (2024-02-16) - windows | mac | linux | linux (wayland): Fixed bug with directories containing files with spaces causing the process to fail, added concatenated text output named bookname.txt, fixed selecting a custom training in flatpak build, fixed getgbook on arm64 MacOS, improved layout of log area to fill all available space in the window, improved readability of log area text.
v1.1.0 (2023-02-14) - windows | mac | linux | linux (wayland): Improved PDF reading by adding support for embedded CCITT images. Improved PDF parsing to prevent a possible crash with bad PDF files. Improved error messages for unreadable PDFs. Improved GUI theme thanks to an update to Fyne.
v1.0.0 (2022-03-22) - windows | mac | linux: Thanks to our fabulous Kickstarter backers, lots of improvements! Added GUI, added PDF extractor, added Google Book downloader, created a single binary for OSX for M1 & amd64, added file renamer so page files no longer need a particular naming format, added option to disable page wiping, added option to create full size PDF.
v0.5.3 (2021-10-18) - windows | mac | mac m1 | linux: Fix issue loading external training file.
v0.5.2 (2021-10-01) - windows | mac | mac m1 | linux: Added embedded Latin training for modern printed works.
v0.5.1 (2021-08-30) - windows | mac | mac m1 | linux: Improved scaling of embedded images in PDFs to ensure they are always legible, and changed to always encode them as JPEG to save space.
v0.5.0 (2021-08-19) - windows | mac | mac m1 | linux: Added M1 (arm64) build, fixed PDF word coordinates, improved PDF copy-paste output, scaled hidden text to perfectly match image.
v0.4.0 (2021-07-09) - windows | mac | linux: Embedded prebuilt Tesseract and training data into the binary so no dependencies are needed, improved error messages and increased robustness.
v0.3.3 (2021-05-16) - windows | mac | linux: Changed default training directory to trainings/, fixed a rare crash that could happen when preprocessing some images
v0.3.2 (2020-12-07) - windows | mac | linux: Allowed saving results to a different directory than the input directory, ensured that saving into a directory which already has output files works correctly, robustness improvements particularly on Windows.
v0.3.0 (2020-11-17) - windows | mac | linux: Initial public release of standalone tool.

More information

For more information on the inner workings of rescribe, take a look at our blog.