Languages and scripts for Rescribe OCR

The Rescribe OCR tool includes several different languages and scripts, but there are many more freely available which can be downloaded and used with it.

To use one of the below files, download it onto your computer, then click on the "Other..." option in the "Language / Script" box in Rescribe. Then you can find and select the file you downloaded.

Beta models

These are models we have created, but have not yet had enough testing, or haven't been trained on a wide enough sample of ground truth, to be released with the Rescribe tool.

Caroline minuscule (beta version)
Welsh (printed ca 1500-1800)

Tesseract models

These are all training files produced by Tesseract OCR, the OCR engine that Rescribe uses. Any other modern training files for Tesseract you find online should work fine with Rescribe too.

Afrikaans
Amharic
Albanian
Arabic
Armenian
Assamese
Azeri
Azeri, Cyrillic
Basque
Belarusian
Bengali
Bosnian
Breton
Bulgarian
Burmese
Catalan
Cebuano
Central Khmer, Khmer
Central Tibetan
Cherokee
Chinese, Simplified
Chinese, Simplified vertical
Chinese, Traditional
Chinese, Traditional vertical
Corsican
Croatian
Czech
Danish
German
Dhivehi, Maldivian
Dutch, Flemish
Dzongkha
Greek, Modern
English, Modern
English, Middle
Esperanto
Estonian
Faroese
Filipino
Finnish
Frankish
French, Middle
French, Modern
Frisian, Western
Gaelic
Galician
Greek, Ancient
Gujarati
Haitian, Haitian Creole
Hebrew
Hindi
Hungarian
Icelandic
Inuktitut
Indonesian
Irish
Italian
Japanese
Javanese
Japanese vertical
Kannada
Georgian
Kazakh
Kirghiz, Kyrgyz
Korean
Korean vertical
Kurdish, Northern
Lao
Latin
Latvian
Lithuanian
Letzeburgesch, Luxembourgish
Macedonian
Malayalam
Maltese
Mongolian
Maori
Malay
Marathi
Moldavian, Moldovan, Romanian
Nepali
Norwegian
Occitan (post 1500)
Oriya
Panjabi, Punjabi
Persian
Polish
Portuguese
Pashto/Pushto
Quechua
Russian
Sanskrit
Sindhi
Sinhala, Sinhalese
Slovak
Slovenian
Spanish
Serbian
Serbian Latin script
Sundanese
Swahili
Swedish
Syriac
Tajik
Tamil
Tatar
Telugu
Thai
Tigrinya
Tonga
Turkish
Uighur
Ukrainian
Urdu
Uzbek
Uzbek, Cyrillic
Vietnamese
Welsh
Yiddish
Yoruba

License

These training files are all released by the Tesseract OCR project and are licensed under the Apache 2.0 license.