/usr/share/doc/python-gamera.toolkits.ocr/html/index.html is in python-gamera.toolkits.ocr 1.0.6-3.
This file is owned by root:root, with mode 0o644.
The actual contents of the file can be viewed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 | <?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name="generator" content="Docutils 0.8.1: http://docutils.sourceforge.net/" />
<title>OCR toolkit for Gamera</title>
<link rel="stylesheet" href="default.css" type="text/css" />
</head>
<body>
<div class="document" id="ocr-toolkit-for-gamera">
<h1 class="title">OCR toolkit for Gamera</h1>
<p><strong>Last modified</strong>: February 13, 2012</p>
<div class="contents topic" id="contents">
<p class="topic-title first">Contents</p>
<ul class="simple">
<li><a class="reference internal" href="#overview" id="id7">Overview</a><ul>
<li><a class="reference internal" href="#the-recognition-process" id="id8">The recognition process</a></li>
<li><a class="reference internal" href="#provided-components" id="id9">Provided Components</a></li>
<li><a class="reference internal" href="#limitations" id="id10">Limitations</a></li>
</ul>
</li>
<li><a class="reference internal" href="#user-s-manual" id="id11">User's Manual</a></li>
<li><a class="reference internal" href="#developer-s-manual" id="id12">Developer's Manual</a></li>
<li><a class="reference internal" href="#installation" id="id13">Installation</a><ul>
<li><a class="reference internal" href="#prerequisites" id="id14">Prerequisites</a></li>
<li><a class="reference internal" href="#building-and-installing" id="id15">Building and Installing</a></li>
<li><a class="reference internal" href="#installing-without-root-privileges" id="id16">Installing without root privileges</a></li>
<li><a class="reference internal" href="#uninstallation" id="id17">Uninstallation</a><ul>
<li><a class="reference internal" href="#python-library-files" id="id18">Python Library Files</a></li>
<li><a class="reference internal" href="#standalone-scripts" id="id19">Standalone Scripts</a></li>
</ul>
</li>
</ul>
</li>
<li><a class="reference internal" href="#about-this-documentation" id="id20">About this documentation</a></li>
</ul>
</div>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field"><th class="field-name">Editor:</th><td class="field-body">Rene Baston, Christoph Dalitz</td>
</tr>
<tr class="field"><th class="field-name">Version:</th><td class="field-body">1.0.6</td>
</tr>
</tbody>
</table>
<p>Use the 'Addons' section on the <a class="reference external" href="http://gamera.informatik.hsnr.de/addons/">Gamera home page</a> for access to file
releases of this toolkit.</p>
<div class="section" id="overview">
<h1><a class="toc-backref" href="#id7">Overview</a></h1>
<p>The purpose of the <em>OCR Toolkit</em> is to help building optical character
recognition (OCR) systems for standard text documents. Even though it
can be used as is, it is specifically designed to make individual
steps of the recognition system customizable and replacable.
The toolkit is based on and requires the <a class="reference external" href="http://gamera.sf.net/">Gamera framework</a> for document
analysis and recognition. As an addon package for Gamera, it provides</p>
<ul class="simple">
<li>python library functions for building a custom OCR system</li>
<li>a ready-to-run python script <tt class="docutils literal">ocr4gamera</tt> which acts as a basic OCR-system</li>
</ul>
<p>A comprehensive overview of design, usage and customization of the OCR toolkit
can be found in the paper</p>
<blockquote>
C. Dalitz, R. Baston: <a class="reference external" href="http://lionel.kr.hsnr.de/~dalitz/data/publications/sr09-ocr-gamera.pdf">Optical Character Recognition with the
Gamera Framework.</a> In C. Dalitz (Ed.): "Document Image Analysis with
the Gamera Framework." Schriftenreihe des Fachbereichs Elektrotechnik und
Informatik, Hochschule Niederrhein, vol. 8, pp. 53-65, Shaker Verlag (2009)</blockquote>
<div class="section" id="the-recognition-process">
<h2><a class="toc-backref" href="#id8">The recognition process</a></h2>
<p><em>Optical character recognition</em> (OCR) means the extraction of a
machine readable text code from bitmap images of text documents.
This process typically consists of the following steps:</p>
<dl class="docutils">
<dt><strong>Preprocessing:</strong></dt>
<dd>Includes binarization, skew correction, image enhancement,
text/graphics separation</dd>
<dt><strong>Segmentation:</strong></dt>
<dd>Segmentation of the page in text lines (page segmentation) and
characters (character segmentation)</dd>
<dt><strong>Classification:</strong></dt>
<dd>Identification of the individual characters</dd>
<dt><strong>Postprocessing:</strong></dt>
<dd>Includes the generation of the output string and maybe detection
and correction of possible errors</dd>
</dl>
<p>The OCR toolkit only covers the process from segmentation to postprocessing.
For preprocessing, the standard routines shipped with Gamera must be used
beforehand, e.g. <em>rotation_angle_projections</em> for skew correction, or
<em>despeckle</em> for noise removal.</p>
<p>For classification, the kNN classifier shipped with Gamera must be used.
This means in particular, that you must train some sample pages before
doing the classification. At present, the toolkit does not include training
databases for common fonts.</p>
</div>
<div class="section" id="provided-components">
<h2><a class="toc-backref" href="#id9">Provided Components</a></h2>
<p>The toolkit consists of two python modules, a plugin image function and
one end user application.</p>
<p>The modules are</p>
<ul class="simple">
<li><em>classes</em> which contains all class definitions</li>
<li><em>ocr_toolkit</em> for global functions used across the classes</li>
</ul>
<p>The end user application is</p>
<ul class="simple">
<li><em>ocr4gamera.py</em> is a script that acts as a basic OCR-system</li>
</ul>
<p>There is also one image plugin <em>bbox_seg</em> for textline segmentation
which is simply a wrapper around the Gamera core plugin <tt class="docutils literal">bbox_segmentation</tt>.</p>
</div>
<div class="section" id="limitations">
<h2><a class="toc-backref" href="#id10">Limitations</a></h2>
<p>As the segmentation of the individual characters is based on a connected
component analysis, the toolkit cannot deal with touching characters, unless
they have been trained as ligaturae. It is therefore in general only
applicable to printed documents, rather than handwritten documents.</p>
<p>From a user's perspective, there are some points to beware in this toolkit:</p>
<ul class="simple">
<li>It does not provide methods for text/graphics separation. Hopefully,
some generic methods for this purpose will be added at some point in
the Gamera core.</li>
<li>It does not provide prototypes of latin characters. This means that
characters must be trained on sample pages before using the toolkit.</li>
<li>The standard page segmentation algorithm for textline separation
is currently very basic.</li>
</ul>
</div>
</div>
<div class="section" id="user-s-manual">
<h1><a class="toc-backref" href="#id11">User's Manual</a></h1>
<p>This documentation is written for those who want to use the toolkit
for OCR, but are not interested in extending the toolkit itself.</p>
<ul class="simple">
<li><a class="reference external" href="usermanual.html">Using the toolkit</a>: gives an explanation on how to use the toolkit.</li>
</ul>
</div>
<div class="section" id="developer-s-manual">
<h1><a class="toc-backref" href="#id12">Developer's Manual</a></h1>
<p>This documentation is for those who want to extend the functionality of
the OCR toolkit, or who want to customize specific steps of the
recognition process.</p>
<ul class="simple">
<li><a class="reference external" href="developermanual.html">Developer's manual</a>: describes how to customize the recognition process</li>
<li><a class="reference external" href="classes.html">Classes</a>: reference for the classes involved in the segmentation process.
These are:<ul>
<li><a class="reference external" href="gamera.toolkits.ocr.classes.Page.html">Page</a> for doing the page segmentation</li>
<li><a class="reference external" href="gamera.toolkits.ocr.classes.Textline.html">Textline</a> for storing the segmentation result within <tt class="docutils literal">Page</tt></li>
<li><a class="reference external" href="gamera.toolkits.ocr.classes.ClassifyCCs.html">ClassifyCCs</a> for (optionally) doing the classification during page
segmentation</li>
</ul>
</li>
<li><a class="reference external" href="functions.html">Functions</a>: the global functions defined by the toolkit</li>
<li><a class="reference external" href="plugins.html">Plugins</a>: Reference for the plugin functions shipped with this toolkit</li>
</ul>
</div>
<div class="section" id="installation">
<h1><a class="toc-backref" href="#id13">Installation</a></h1>
<p>We have only tested the toolkit on Linux and MacOS X, but as
the toolkit is written entirely in Python, the following
instructions should work for any operating system.</p>
<div class="section" id="prerequisites">
<h2><a class="toc-backref" href="#id14">Prerequisites</a></h2>
<p>First you will need a working installation of Gamera 3.x. See the
<a class="reference external" href="http://gamera.sourceforge.net/">Gamera website</a> for details. It is strongly recommended that you use
a recent version, preferably from SVN.</p>
<p>If you want to generate the documentation, you will need two additional
third-party Python libraries:</p>
<blockquote>
<ul class="simple">
<li><a class="reference external" href="http://docutils.sourceforge.net/">docutils</a> for handling reStructuredText documents.</li>
<li><a class="reference external" href="http://pygments.org/">pygments</a> for colorizing source code.</li>
</ul>
</blockquote>
<div class="note">
<p class="first admonition-title">Note</p>
<p class="last">It is generally not necessary to generate the documentation
because it is included in file releases of the toolkit.</p>
</div>
</div>
<div class="section" id="building-and-installing">
<h2><a class="toc-backref" href="#id15">Building and Installing</a></h2>
<p>To build and install this toolkit, go to the base directory of the toolkit
distribution and run the <tt class="docutils literal">setup.py</tt> script as follows:</p>
<pre class="literal-block">
# 1) compile
python setup.py build
# 2) install
sudo python setup.py install
</pre>
<p>Command 1) compiles the toolkit from the sources and command 2) installs
it. As the latter requires
root privilegue, you need to use <tt class="docutils literal">sudo</tt> on Linux and MacOS X. On Windows,
<tt class="docutils literal">sudo</tt> is not necessary.</p>
<p>Note that the script <em>ocr4gamera</em> is installed into <tt class="docutils literal">/usr/bin</tt> on Linux,
but into <tt class="docutils literal">/System/Library/Frameworks/Python.framework/Versions/2.x/bin</tt>
on MacOS X. As the latter directory is not in the standard search path,
you could either add it to your search path, or install the scripts
additionally into <tt class="docutils literal">/usr/bin</tt> on MacOS X with:</p>
<pre class="literal-block">
# install scripts into standard path (MacOS X only)
sudo python setup.py install_scripts -d /usr/bin
</pre>
<p>If you want to regenerate the documentation, go to the <tt class="docutils literal">doc</tt> directory
and run the <tt class="docutils literal">gendoc.py</tt> script. The output will be placed in the <tt class="docutils literal">doc/html/</tt>
directory. The contents of this directory can be placed on a webserver
for convenient viewing.</p>
<div class="note">
<p class="first admonition-title">Note</p>
<p class="last">Before building the documentation you must install the
toolkit. Otherwise <tt class="docutils literal">gendoc.py</tt> will not find the plugin documentation.</p>
</div>
</div>
<div class="section" id="installing-without-root-privileges">
<h2><a class="toc-backref" href="#id16">Installing without root privileges</a></h2>
<p>The above installation with <tt class="docutils literal">python setup.py install</tt> will install
the toolkit system wide and thus requires root privileges. If you do
not have root access (Linux) or are no sudoer (MacOS X), you can
install the MusicStaves toolkit into your home directory. Note however
that this also requires that Gamera is installed into your home directory.
It is currently not possibole to install Gamera globally and only toolkits
locally.</p>
<p>Here are the steps to install both Gamera and the OCR toolkit into
<tt class="docutils literal">~/python</tt>:</p>
<pre class="literal-block">
# install Gamera locally
mkdir ~/python
python setup.py install --prefix=~/python
# build and install the OCR toolkit locally
export CFLAGS=-I~/python/include/python2.3/gamera
python setup.py build
python setup.py install --prefix=~/python
</pre>
<p>Moreover you should set the following environment variables in your
<tt class="docutils literal"><span class="pre">~/.profile</span></tt>:</p>
<pre class="literal-block">
# search path for python modules
export PYTHONPATH=~/python/lib/python
# search path for executables (eg. gamera_gui)
export PATH=~/python/bin:$PATH
</pre>
</div>
<div class="section" id="uninstallation">
<h2><a class="toc-backref" href="#id17">Uninstallation</a></h2>
<p>The installation uses the Python <em>distutils</em>, which do not support
uninstallation. Thus you need to remove the installed files manually:</p>
<ul class="simple">
<li>the installed Python library files of the toolkit</li>
<li>the installed standalone scripts</li>
</ul>
<div class="section" id="python-library-files">
<h3><a class="toc-backref" href="#id18">Python Library Files</a></h3>
<p>All python library files of this toolkit are installed into the
<tt class="docutils literal">gamera/toolkits/ocr</tt> subdirectory of the Python library folder.
Thus it is sufficient to remove this directory for an uninstallation.</p>
<p>Where the python library folder is depends on your system and python version.
Here are the folders that you need to remove on MacOS X and Debian Linux
("2.3" stands for the python version; replace it with your actual version):</p>
<blockquote>
<ul class="simple">
<li>MacOS X: <tt class="docutils literal">/Library/Python/2.3/gamera/toolkits/ocr</tt></li>
<li>Debian Linux: <tt class="docutils literal"><span class="pre">/usr/lib/python2.3/site-packages/gamera/toolkits/ocr</span></tt></li>
</ul>
</blockquote>
</div>
<div class="section" id="standalone-scripts">
<h3><a class="toc-backref" href="#id19">Standalone Scripts</a></h3>
<p>The standalone scripts are installed into <tt class="docutils literal">/usr/bin</tt> (linux) or
<tt class="docutils literal">/System/Library/Frameworks/Python.framework/Versions/2.3/bin</tt> (MacOS X),
unless you have explicitly chosen a different location with the options
<tt class="docutils literal"><span class="pre">--prefix</span></tt> or <tt class="docutils literal"><span class="pre">--home</span></tt> during installation.</p>
<p>For an uninstall, remove the following script:</p>
<blockquote>
<ul class="simple">
<li><tt class="docutils literal">ocr4gamera.py</tt></li>
</ul>
</blockquote>
<div class="note">
<p class="first admonition-title">Note</p>
<p class="last">In older versions (1.0.0 and 1.0.1) this script was named
<tt class="docutils literal">ocr4gamera</tt>. Remove this old script, if you are upgrading from
one of these versions.</p>
</div>
</div>
</div>
</div>
<div class="section" id="about-this-documentation">
<h1><a class="toc-backref" href="#id20">About this documentation</a></h1>
<p>The documentation was written by Rene Baston and Christoph Dalitz.
Permission is granted to copy, distribute and/or modify this documentation
under the terms of the <a class="reference external" href="http://creativecommons.org/licenses/by-sa/3.0/">Creative Commons Attribution Share-Alike License (CC-BY-SA) v3.0</a>. In addition, permission is granted to use and/or
modify the code snippets from the documentation without restrictions.</p>
</div>
</div>
</body>
</html>
|