This file is indexed.

/usr/share/doc/python-gamera.toolkits.ocr/html/index.html is in python-gamera.toolkits.ocr 1.0.6-3.

This file is owned by root:root, with mode 0o644.

The actual contents of the file can be viewed below.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name="generator" content="Docutils 0.8.1: http://docutils.sourceforge.net/" />
<title>OCR toolkit for Gamera</title>
<link rel="stylesheet" href="default.css" type="text/css" />
</head>
<body>
<div class="document" id="ocr-toolkit-for-gamera">
<h1 class="title">OCR toolkit for Gamera</h1>

<p><strong>Last modified</strong>: February 13, 2012</p>
<div class="contents topic" id="contents">
<p class="topic-title first">Contents</p>
<ul class="simple">
<li><a class="reference internal" href="#overview" id="id7">Overview</a><ul>
<li><a class="reference internal" href="#the-recognition-process" id="id8">The recognition process</a></li>
<li><a class="reference internal" href="#provided-components" id="id9">Provided Components</a></li>
<li><a class="reference internal" href="#limitations" id="id10">Limitations</a></li>
</ul>
</li>
<li><a class="reference internal" href="#user-s-manual" id="id11">User's Manual</a></li>
<li><a class="reference internal" href="#developer-s-manual" id="id12">Developer's Manual</a></li>
<li><a class="reference internal" href="#installation" id="id13">Installation</a><ul>
<li><a class="reference internal" href="#prerequisites" id="id14">Prerequisites</a></li>
<li><a class="reference internal" href="#building-and-installing" id="id15">Building and Installing</a></li>
<li><a class="reference internal" href="#installing-without-root-privileges" id="id16">Installing without root privileges</a></li>
<li><a class="reference internal" href="#uninstallation" id="id17">Uninstallation</a><ul>
<li><a class="reference internal" href="#python-library-files" id="id18">Python Library Files</a></li>
<li><a class="reference internal" href="#standalone-scripts" id="id19">Standalone Scripts</a></li>
</ul>
</li>
</ul>
</li>
<li><a class="reference internal" href="#about-this-documentation" id="id20">About this documentation</a></li>
</ul>
</div>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field"><th class="field-name">Editor:</th><td class="field-body">Rene Baston, Christoph Dalitz</td>
</tr>
<tr class="field"><th class="field-name">Version:</th><td class="field-body">1.0.6</td>
</tr>
</tbody>
</table>
<p>Use the 'Addons' section on the <a class="reference external" href="http://gamera.informatik.hsnr.de/addons/">Gamera home page</a> for access to file
releases of this toolkit.</p>
<div class="section" id="overview">
<h1><a class="toc-backref" href="#id7">Overview</a></h1>
<p>The purpose of the <em>OCR Toolkit</em> is to help building optical character
recognition (OCR) systems for standard text documents. Even though it
can be used as is, it is specifically designed to make individual
steps of the recognition system customizable and replacable.
The toolkit is based on and requires the <a class="reference external" href="http://gamera.sf.net/">Gamera framework</a> for document
analysis and recognition. As an addon package for Gamera, it provides</p>
<ul class="simple">
<li>python library functions for building a custom OCR system</li>
<li>a ready-to-run python script <tt class="docutils literal">ocr4gamera</tt> which acts as a basic OCR-system</li>
</ul>
<p>A comprehensive overview of design, usage and customization of the OCR toolkit
can be found in the paper</p>
<blockquote>
C. Dalitz, R. Baston: <a class="reference external" href="http://lionel.kr.hsnr.de/~dalitz/data/publications/sr09-ocr-gamera.pdf">Optical Character Recognition with the
Gamera Framework.</a> In C. Dalitz (Ed.): &quot;Document Image Analysis with
the Gamera Framework.&quot; Schriftenreihe des Fachbereichs Elektrotechnik und
Informatik, Hochschule Niederrhein, vol. 8, pp. 53-65, Shaker Verlag (2009)</blockquote>
<div class="section" id="the-recognition-process">
<h2><a class="toc-backref" href="#id8">The recognition process</a></h2>
<p><em>Optical character recognition</em> (OCR) means the extraction of a
machine readable text code from bitmap images of text documents.
This process typically consists of the following steps:</p>
<dl class="docutils">
<dt><strong>Preprocessing:</strong></dt>
<dd>Includes binarization, skew correction, image enhancement,
text/graphics separation</dd>
<dt><strong>Segmentation:</strong></dt>
<dd>Segmentation of the page in text lines (page segmentation) and
characters (character segmentation)</dd>
<dt><strong>Classification:</strong></dt>
<dd>Identification of the individual characters</dd>
<dt><strong>Postprocessing:</strong></dt>
<dd>Includes the generation of the output string and maybe detection
and correction of possible errors</dd>
</dl>
<p>The OCR toolkit only covers the process from segmentation to postprocessing.
For preprocessing, the standard routines shipped with Gamera must be used
beforehand, e.g. <em>rotation_angle_projections</em> for skew correction, or
<em>despeckle</em> for noise removal.</p>
<p>For classification, the kNN classifier shipped with Gamera must be used.
This means in particular, that you must train some sample pages before
doing the classification. At present, the toolkit does not include training
databases for common fonts.</p>
</div>
<div class="section" id="provided-components">
<h2><a class="toc-backref" href="#id9">Provided Components</a></h2>
<p>The toolkit consists of two python modules, a plugin image function and
one end user application.</p>
<p>The modules are</p>
<ul class="simple">
<li><em>classes</em> which contains all class definitions</li>
<li><em>ocr_toolkit</em> for global functions used across the classes</li>
</ul>
<p>The end user application is</p>
<ul class="simple">
<li><em>ocr4gamera.py</em> is a script that acts as a basic OCR-system</li>
</ul>
<p>There is also one image plugin <em>bbox_seg</em> for textline segmentation
which is simply a wrapper around the Gamera core plugin <tt class="docutils literal">bbox_segmentation</tt>.</p>
</div>
<div class="section" id="limitations">
<h2><a class="toc-backref" href="#id10">Limitations</a></h2>
<p>As the segmentation of the individual characters is based on a connected
component analysis, the toolkit cannot deal with touching characters, unless
they have been trained as ligaturae. It is therefore in general only
applicable to printed documents, rather than handwritten documents.</p>
<p>From a user's perspective, there are some points to beware in this toolkit:</p>
<ul class="simple">
<li>It does not provide methods for text/graphics separation. Hopefully,
some generic methods for this purpose will be added at some point in
the Gamera core.</li>
<li>It does not provide prototypes of latin characters. This means that
characters must be trained on sample pages before using the toolkit.</li>
<li>The standard page segmentation algorithm for textline separation
is currently very basic.</li>
</ul>
</div>
</div>
<div class="section" id="user-s-manual">
<h1><a class="toc-backref" href="#id11">User's Manual</a></h1>
<p>This documentation is written for those who want to use the toolkit
for OCR, but are not interested in extending the toolkit itself.</p>
<ul class="simple">
<li><a class="reference external" href="usermanual.html">Using the toolkit</a>: gives an explanation on how to use the toolkit.</li>
</ul>
</div>
<div class="section" id="developer-s-manual">
<h1><a class="toc-backref" href="#id12">Developer's Manual</a></h1>
<p>This documentation is for those who want to extend the functionality of
the OCR toolkit, or who want to customize specific steps of the
recognition process.</p>
<ul class="simple">
<li><a class="reference external" href="developermanual.html">Developer's manual</a>: describes how to customize the recognition process</li>
<li><a class="reference external" href="classes.html">Classes</a>: reference for the classes involved in the segmentation process.
These are:<ul>
<li><a class="reference external" href="gamera.toolkits.ocr.classes.Page.html">Page</a> for doing the page segmentation</li>
<li><a class="reference external" href="gamera.toolkits.ocr.classes.Textline.html">Textline</a> for storing the segmentation result within <tt class="docutils literal">Page</tt></li>
<li><a class="reference external" href="gamera.toolkits.ocr.classes.ClassifyCCs.html">ClassifyCCs</a> for (optionally) doing the classification during page
segmentation</li>
</ul>
</li>
<li><a class="reference external" href="functions.html">Functions</a>: the global functions defined by the toolkit</li>
<li><a class="reference external" href="plugins.html">Plugins</a>: Reference for the plugin functions shipped with this toolkit</li>
</ul>
</div>
<div class="section" id="installation">
<h1><a class="toc-backref" href="#id13">Installation</a></h1>
<p>We have only tested the toolkit on Linux and MacOS X, but as
the toolkit is written entirely in Python, the following
instructions should work for any operating system.</p>
<div class="section" id="prerequisites">
<h2><a class="toc-backref" href="#id14">Prerequisites</a></h2>
<p>First you will need a working installation of Gamera 3.x. See the
<a class="reference external" href="http://gamera.sourceforge.net/">Gamera website</a> for details. It is strongly recommended that you use
a recent version, preferably from SVN.</p>
<p>If you want to generate the documentation, you will need two additional
third-party Python libraries:</p>
<blockquote>
<ul class="simple">
<li><a class="reference external" href="http://docutils.sourceforge.net/">docutils</a> for handling reStructuredText documents.</li>
<li><a class="reference external" href="http://pygments.org/">pygments</a>  for colorizing source code.</li>
</ul>
</blockquote>
<div class="note">
<p class="first admonition-title">Note</p>
<p class="last">It is generally not necessary to generate the documentation
because it is included in file releases of the toolkit.</p>
</div>
</div>
<div class="section" id="building-and-installing">
<h2><a class="toc-backref" href="#id15">Building and Installing</a></h2>
<p>To build and install this toolkit, go to the base directory of the toolkit
distribution and run the <tt class="docutils literal">setup.py</tt> script as follows:</p>
<pre class="literal-block">
# 1) compile
python setup.py build

# 2) install
sudo python setup.py install
</pre>
<p>Command 1) compiles the toolkit from the sources and command 2) installs
it. As the latter requires
root privilegue, you need to use <tt class="docutils literal">sudo</tt> on Linux and MacOS X. On Windows,
<tt class="docutils literal">sudo</tt> is not necessary.</p>
<p>Note that the script <em>ocr4gamera</em> is installed into <tt class="docutils literal">/usr/bin</tt> on Linux,
but into <tt class="docutils literal">/System/Library/Frameworks/Python.framework/Versions/2.x/bin</tt>
on MacOS X. As the latter directory is not in the standard search path,
you could either add it to your search path, or install the scripts
additionally into <tt class="docutils literal">/usr/bin</tt> on MacOS X with:</p>
<pre class="literal-block">
# install scripts into standard path (MacOS X only)
sudo python setup.py install_scripts -d /usr/bin
</pre>
<p>If you want to regenerate the documentation, go to the <tt class="docutils literal">doc</tt> directory
and run the <tt class="docutils literal">gendoc.py</tt> script. The output will be placed in the <tt class="docutils literal">doc/html/</tt>
directory.  The contents of this directory can be placed on a webserver
for convenient viewing.</p>
<div class="note">
<p class="first admonition-title">Note</p>
<p class="last">Before building the documentation you must install the
toolkit. Otherwise <tt class="docutils literal">gendoc.py</tt> will not find the plugin documentation.</p>
</div>
</div>
<div class="section" id="installing-without-root-privileges">
<h2><a class="toc-backref" href="#id16">Installing without root privileges</a></h2>
<p>The above installation with <tt class="docutils literal">python setup.py install</tt> will install
the toolkit system wide and thus requires root privileges. If you do
not have root access (Linux) or are no sudoer (MacOS X), you can
install the MusicStaves toolkit into your home directory. Note however
that this also requires that Gamera is installed into your home directory.
It is currently not possibole to install Gamera globally and only toolkits
locally.</p>
<p>Here are the steps to install both Gamera and the OCR toolkit into
<tt class="docutils literal">~/python</tt>:</p>
<pre class="literal-block">
# install Gamera locally
mkdir ~/python
python setup.py install --prefix=~/python

# build and install the OCR toolkit locally
export CFLAGS=-I~/python/include/python2.3/gamera
python setup.py build
python setup.py install --prefix=~/python
</pre>
<p>Moreover you should set the following environment variables in your
<tt class="docutils literal"><span class="pre">~/.profile</span></tt>:</p>
<pre class="literal-block">
# search path for python modules
export PYTHONPATH=~/python/lib/python

# search path for executables (eg. gamera_gui)
export PATH=~/python/bin:$PATH
</pre>
</div>
<div class="section" id="uninstallation">
<h2><a class="toc-backref" href="#id17">Uninstallation</a></h2>
<p>The installation uses the Python <em>distutils</em>, which do not support
uninstallation. Thus you need to remove the installed files manually:</p>
<ul class="simple">
<li>the installed Python library files of the toolkit</li>
<li>the installed standalone scripts</li>
</ul>
<div class="section" id="python-library-files">
<h3><a class="toc-backref" href="#id18">Python Library Files</a></h3>
<p>All python library files of this toolkit are installed into the
<tt class="docutils literal">gamera/toolkits/ocr</tt> subdirectory of the Python library folder.
Thus it is sufficient to remove this directory for an uninstallation.</p>
<p>Where the python library folder is depends on your system and python version.
Here are the folders that you need to remove on MacOS X and Debian Linux
(&quot;2.3&quot; stands for the python version; replace it with your actual version):</p>
<blockquote>
<ul class="simple">
<li>MacOS X: <tt class="docutils literal">/Library/Python/2.3/gamera/toolkits/ocr</tt></li>
<li>Debian Linux: <tt class="docutils literal"><span class="pre">/usr/lib/python2.3/site-packages/gamera/toolkits/ocr</span></tt></li>
</ul>
</blockquote>
</div>
<div class="section" id="standalone-scripts">
<h3><a class="toc-backref" href="#id19">Standalone Scripts</a></h3>
<p>The standalone scripts are installed into <tt class="docutils literal">/usr/bin</tt> (linux) or
<tt class="docutils literal">/System/Library/Frameworks/Python.framework/Versions/2.3/bin</tt> (MacOS X),
unless you have explicitly chosen a different location with the options
<tt class="docutils literal"><span class="pre">--prefix</span></tt> or <tt class="docutils literal"><span class="pre">--home</span></tt> during installation.</p>
<p>For an uninstall, remove the following script:</p>
<blockquote>
<ul class="simple">
<li><tt class="docutils literal">ocr4gamera.py</tt></li>
</ul>
</blockquote>
<div class="note">
<p class="first admonition-title">Note</p>
<p class="last">In older versions (1.0.0 and 1.0.1) this script was named
<tt class="docutils literal">ocr4gamera</tt>. Remove this old script, if you are upgrading from
one of these versions.</p>
</div>
</div>
</div>
</div>
<div class="section" id="about-this-documentation">
<h1><a class="toc-backref" href="#id20">About this documentation</a></h1>
<p>The documentation was written by Rene Baston and Christoph Dalitz.
Permission is granted to copy, distribute and/or modify this documentation
under the terms of the <a class="reference external" href="http://creativecommons.org/licenses/by-sa/3.0/">Creative Commons Attribution Share-Alike License (CC-BY-SA) v3.0</a>. In addition, permission is granted to use and/or
modify the code snippets from the documentation without restrictions.</p>
</div>
</div>
</body>
</html>