Revision of Optical Character Recognition With Tesseract OCR On Ubuntu 7.04 from Wed, 08/29/2007 - 08:38

This document describes how to set up Tesseract OCR on Ubuntu 7.04. OCR means "Optical Character Recognition". The resulting system will be able to convert images with embedded text to text files. Tesseract is licensed under the Apache License v2.0.

1 Preparation

Set up a basic Ubuntu 7.04 system and update it. Get scanned images or scan documents yourself.

If you use a scanner, be sure that it is supported by sane. A list of supported devices is vailable at

2 Get Imagemagick