This file is indexed.

/usr/share/doc/python-gamera.toolkits.ocr/html/functions.html is in python-gamera.toolkits.ocr 1.0.6-3.

This file is owned by root:root, with mode 0o644.

The actual contents of the file can be viewed below.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name="generator" content="Docutils 0.8.1: http://docutils.sourceforge.net/" />
<title>OCR Toolkit: Global Functions</title>
<link rel="stylesheet" href="default.css" type="text/css" />
</head>
<body>
<div class="document" id="ocr-toolkit-global-functions">
<h1 class="title">OCR Toolkit: Global Functions</h1>

<p><strong>Last modified</strong>: February 13, 2012</p>
<div class="contents topic" id="contents">
<p class="topic-title first">Contents</p>
<ul class="simple">
<li><a class="reference internal" href="#output-text-generation" id="id5">Output text generation</a><ul>
<li><a class="reference internal" href="#id1" id="id6"><tt class="docutils literal">textline_to_string</tt></a></li>
<li><a class="reference internal" href="#id2" id="id7"><tt class="docutils literal">return_char</tt></a></li>
<li><a class="reference internal" href="#chars-make-words" id="id8"><tt class="docutils literal">chars_make_words</tt></a></li>
</ul>
</li>
<li><a class="reference internal" href="#segmentation" id="id9">Segmentation</a><ul>
<li><a class="reference internal" href="#get-line-glyphs" id="id10"><tt class="docutils literal">get_line_glyphs</tt></a></li>
<li><a class="reference internal" href="#show-bboxes" id="id11"><tt class="docutils literal">show_bboxes</tt></a></li>
</ul>
</li>
</ul>
</div>
<p>The toolkit defines a number of free function which are not image
methods. These are defined in <em>ocr_toolkit.py</em> and can be imported
in a python script with</p>
<div class="highlight"><pre><span class="kn">from</span> <span class="nn">gamera.toolkits.ocr.ocr_toolkit</span> <span class="kn">import</span> <span class="o">*</span>
</pre></div>
<div class="section" id="output-text-generation">
<h1><a class="toc-backref" href="#id5">Output text generation</a></h1>
<p>While the class <a class="reference external" href="gamera.toolkits.ocr.classes.Page.html">Page</a> splits the image into <a class="reference external" href="gamera.toolkits.ocr.classes.Textline.html">Textline</a> objects and
possibly classifies the characters, it does not generate an output string.
For this purpose, you can use the function <a class="reference external" href="#textline-to-string">textline_to_string</a>.</p>
<div class="section" id="id1">
<h2><a class="toc-backref" href="#id6"><tt class="docutils literal">textline_to_string</tt></a></h2>
<p>Returns a unicode string of the text in the given <tt class="docutils literal">Textline</tt>.</p>
<p>Signature:</p>
<blockquote>
<tt class="docutils literal">textline_to_string (textline, <span class="pre">heuristic_rules=&quot;roman&quot;,</span> <span class="pre">extra_chars_dict={})</span></tt></blockquote>
<p>with</p>
<blockquote>
<dl class="docutils">
<dt><em>textline</em>:</dt>
<dd>A <tt class="docutils literal">Textline</tt> object containing the glyphs. The glyphs must already
be classified.</dd>
<dt><em>heuristic_rules</em>:</dt>
<dd><p class="first">Depending on the alphabeth, some characters can very similar and
need further heuristic rules for disambiguation, like apostroph and
comma, which have the same shape and only differ in their position
relative to the baseline.</p>
<p class="last">When set to &quot;roman&quot;, several rules specific for latin alphabeths
are applied.</p>
</dd>
<dt><em>extra_chars_dict</em></dt>
<dd>A dictionary of additional translations of classnames to character codes.
This is necessary when you use class names that are not unicode names.
Will be passed to <a class="reference external" href="#return-char">return_char</a>.</dd>
</dl>
</blockquote>
<p>As this function uses <a class="reference external" href="#return-char">return_char</a>, the class names of the glyphs in
<em>textline</em> must corerspond to unicode character names, as described in
the documentation of <a class="reference external" href="#return-char">return_char</a>.</p>
</div>
<div class="section" id="id2">
<h2><a class="toc-backref" href="#id7"><tt class="docutils literal">return_char</tt></a></h2>
<p>Converts a unicode character name to a unicode symbol.</p>
<p>Signature:</p>
<blockquote>
<tt class="docutils literal">return_char (classname, <span class="pre">extra_chars_dict={})</span></tt></blockquote>
<p>with</p>
<blockquote>
<dl class="docutils">
<dt><em>classname</em>:</dt>
<dd>A class name derived from a unicode character name.
Example: <tt class="docutils literal">latin.small.letter.a</tt> returns the character <tt class="docutils literal">a</tt>.</dd>
<dt><em>extra_chars_dict</em></dt>
<dd>A dictionary of additional translations of classnames to character codes.
This is necessary when you use class names that are not unicode names.
The character 'code' does not need to be an actual code, but can be
any string. This can be useful, e.g. for ligatures:</dd>
</dl>
<div class="highlight"><pre><span class="n">return_char</span><span class="p">(</span><span class="n">glyph</span><span class="o">.</span><span class="n">get_main_id</span><span class="p">(),</span> <span class="p">{</span><span class="s">&#39;latin.small.ligature.st&#39;</span><span class="p">:</span><span class="s">&#39;st&#39;</span><span class="p">})</span>
</pre></div>
</blockquote>
<p>When <em>classname</em> is not listed in <em>extra_chars_dict</em>, it must correspond
to a <a class="reference external" href="http://www.unicode.org/charts/">standard unicode character name</a>,
as in the examples of the following table:</p>
<table border="1" class="docutils">
<colgroup>
<col width="16%" />
<col width="42%" />
<col width="42%" />
</colgroup>
<thead valign="bottom">
<tr><th class="head">Character</th>
<th class="head">Unicode Name</th>
<th class="head">Class Name</th>
</tr>
</thead>
<tbody valign="top">
<tr><td><tt class="docutils literal">!</tt></td>
<td><tt class="docutils literal">EXCLAMATION MARK</tt></td>
<td><tt class="docutils literal">exclamation.mark</tt></td>
</tr>
<tr><td><tt class="docutils literal">2</tt></td>
<td><tt class="docutils literal">DIGIT TWO</tt></td>
<td><tt class="docutils literal">digit.two</tt></td>
</tr>
<tr><td><tt class="docutils literal">A</tt></td>
<td><tt class="docutils literal">LATIN CAPITAL LETTER A</tt></td>
<td><tt class="docutils literal">latin.capital.letter.a</tt></td>
</tr>
<tr><td><tt class="docutils literal">a</tt></td>
<td><tt class="docutils literal">LATIN SMALL LETTER A</tt></td>
<td><tt class="docutils literal">latin.small.letter.a</tt></td>
</tr>
</tbody>
</table>
</div>
<div class="section" id="chars-make-words">
<h2><a class="toc-backref" href="#id8"><tt class="docutils literal">chars_make_words</tt></a></h2>
<p>Groups the given glyphs to words based upon the horizontal distance
between adjacent glyphs.</p>
<dl class="docutils">
<dt>Signature:</dt>
<dd><tt class="docutils literal">chars_make_words (glyphs, threshold=None)</tt></dd>
</dl>
<p>with</p>
<blockquote>
<dl class="docutils">
<dt><em>glyphs</em>:</dt>
<dd>A list of <tt class="docutils literal">Cc</tt> data types, each of which representing a character.
All glyphs must stem from the same single line of text.</dd>
<dt><em>threshold</em>:</dt>
<dd>Horizontal white space greater than <em>threshold</em> will be considered
a word separating gap. When <tt class="docutils literal">None</tt>, the threshold value is
calculated automatically as 2.5 times teh median white space
between adjacent glyphs.</dd>
</dl>
</blockquote>
<p>The result is a nested list of glyphs with each sublist representing
a word. This is the same data structure as used in <a class="reference external" href="gamera.toolkits.ocr.classes.Textline.html">Textline.words</a></p>
</div>
</div>
<div class="section" id="segmentation">
<h1><a class="toc-backref" href="#id9">Segmentation</a></h1>
<p>These functions are used in the segmentation methods of class <a class="reference external" href="gamera.toolkits.ocr.classes.Page.html">Page</a>.
You will generally not need to call them, unless you are implementing
a custom segmentation method.</p>
<div class="section" id="get-line-glyphs">
<h2><a class="toc-backref" href="#id10"><tt class="docutils literal">get_line_glyphs</tt></a></h2>
<p>Splits image regions representing text lines into characters.</p>
<p>Signature:</p>
<blockquote>
<tt class="docutils literal">get_line_glyphs (image, segments)</tt></blockquote>
<p>with</p>
<blockquote>
<dl class="docutils">
<dt><em>image</em>:</dt>
<dd>The document image that is to be further segmentated. It must contin the
same underlying image data as the second argument <em>segments</em></dd>
<dt><em>segments</em>:</dt>
<dd>A list <tt class="docutils literal">Cc</tt> data types, each of which represents a text line region.
The image views must correspond to <em>image</em>, i.e. each pixels has a value
that is the unique label of the text line it belongs to. This is the
interface used by the plugins in the &quot;PageSegmentation&quot; section of the
Gamera core.</dd>
</dl>
</blockquote>
<p>The result is returned as a list of <a class="reference external" href="gamera.toolkits.ocr.classes.Textline.html">Textline</a> objects.</p>
</div>
<div class="section" id="show-bboxes">
<h2><a class="toc-backref" href="#id11"><tt class="docutils literal">show_bboxes</tt></a></h2>
<p>Returns an RGB image with bounding boxes of the given glyphs as
hollow rects. Useful for visualization and debugging of a segmentation.</p>
<p>Signature:</p>
<blockquote>
<tt class="docutils literal">show_bboxes (image, glyphs)</tt></blockquote>
<p>with:</p>
<blockquote>
<dl class="docutils">
<dt><em>image</em>:</dt>
<dd>An image of the textdokument which has to be segmentated.</dd>
<dt><em>glyphs</em>:</dt>
<dd>List of rects which will be drawn on <tt class="docutils literal">image</tt> as hollow rects.
As all image types are derived from <tt class="docutils literal">Rect</tt>, any image list can
be passed.</dd>
</dl>
</blockquote>
</div>
</div>
</div>
</body>
</html>