/usr/share/doc/python-gamera.toolkits.ocr/html/functions.html is in python-gamera.toolkits.ocr 1.0.6-3.
This file is owned by root:root, with mode 0o644.
The actual contents of the file can be viewed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 | <?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name="generator" content="Docutils 0.8.1: http://docutils.sourceforge.net/" />
<title>OCR Toolkit: Global Functions</title>
<link rel="stylesheet" href="default.css" type="text/css" />
</head>
<body>
<div class="document" id="ocr-toolkit-global-functions">
<h1 class="title">OCR Toolkit: Global Functions</h1>
<p><strong>Last modified</strong>: February 13, 2012</p>
<div class="contents topic" id="contents">
<p class="topic-title first">Contents</p>
<ul class="simple">
<li><a class="reference internal" href="#output-text-generation" id="id5">Output text generation</a><ul>
<li><a class="reference internal" href="#id1" id="id6"><tt class="docutils literal">textline_to_string</tt></a></li>
<li><a class="reference internal" href="#id2" id="id7"><tt class="docutils literal">return_char</tt></a></li>
<li><a class="reference internal" href="#chars-make-words" id="id8"><tt class="docutils literal">chars_make_words</tt></a></li>
</ul>
</li>
<li><a class="reference internal" href="#segmentation" id="id9">Segmentation</a><ul>
<li><a class="reference internal" href="#get-line-glyphs" id="id10"><tt class="docutils literal">get_line_glyphs</tt></a></li>
<li><a class="reference internal" href="#show-bboxes" id="id11"><tt class="docutils literal">show_bboxes</tt></a></li>
</ul>
</li>
</ul>
</div>
<p>The toolkit defines a number of free function which are not image
methods. These are defined in <em>ocr_toolkit.py</em> and can be imported
in a python script with</p>
<div class="highlight"><pre><span class="kn">from</span> <span class="nn">gamera.toolkits.ocr.ocr_toolkit</span> <span class="kn">import</span> <span class="o">*</span>
</pre></div>
<div class="section" id="output-text-generation">
<h1><a class="toc-backref" href="#id5">Output text generation</a></h1>
<p>While the class <a class="reference external" href="gamera.toolkits.ocr.classes.Page.html">Page</a> splits the image into <a class="reference external" href="gamera.toolkits.ocr.classes.Textline.html">Textline</a> objects and
possibly classifies the characters, it does not generate an output string.
For this purpose, you can use the function <a class="reference external" href="#textline-to-string">textline_to_string</a>.</p>
<div class="section" id="id1">
<h2><a class="toc-backref" href="#id6"><tt class="docutils literal">textline_to_string</tt></a></h2>
<p>Returns a unicode string of the text in the given <tt class="docutils literal">Textline</tt>.</p>
<p>Signature:</p>
<blockquote>
<tt class="docutils literal">textline_to_string (textline, <span class="pre">heuristic_rules="roman",</span> <span class="pre">extra_chars_dict={})</span></tt></blockquote>
<p>with</p>
<blockquote>
<dl class="docutils">
<dt><em>textline</em>:</dt>
<dd>A <tt class="docutils literal">Textline</tt> object containing the glyphs. The glyphs must already
be classified.</dd>
<dt><em>heuristic_rules</em>:</dt>
<dd><p class="first">Depending on the alphabeth, some characters can very similar and
need further heuristic rules for disambiguation, like apostroph and
comma, which have the same shape and only differ in their position
relative to the baseline.</p>
<p class="last">When set to "roman", several rules specific for latin alphabeths
are applied.</p>
</dd>
<dt><em>extra_chars_dict</em></dt>
<dd>A dictionary of additional translations of classnames to character codes.
This is necessary when you use class names that are not unicode names.
Will be passed to <a class="reference external" href="#return-char">return_char</a>.</dd>
</dl>
</blockquote>
<p>As this function uses <a class="reference external" href="#return-char">return_char</a>, the class names of the glyphs in
<em>textline</em> must corerspond to unicode character names, as described in
the documentation of <a class="reference external" href="#return-char">return_char</a>.</p>
</div>
<div class="section" id="id2">
<h2><a class="toc-backref" href="#id7"><tt class="docutils literal">return_char</tt></a></h2>
<p>Converts a unicode character name to a unicode symbol.</p>
<p>Signature:</p>
<blockquote>
<tt class="docutils literal">return_char (classname, <span class="pre">extra_chars_dict={})</span></tt></blockquote>
<p>with</p>
<blockquote>
<dl class="docutils">
<dt><em>classname</em>:</dt>
<dd>A class name derived from a unicode character name.
Example: <tt class="docutils literal">latin.small.letter.a</tt> returns the character <tt class="docutils literal">a</tt>.</dd>
<dt><em>extra_chars_dict</em></dt>
<dd>A dictionary of additional translations of classnames to character codes.
This is necessary when you use class names that are not unicode names.
The character 'code' does not need to be an actual code, but can be
any string. This can be useful, e.g. for ligatures:</dd>
</dl>
<div class="highlight"><pre><span class="n">return_char</span><span class="p">(</span><span class="n">glyph</span><span class="o">.</span><span class="n">get_main_id</span><span class="p">(),</span> <span class="p">{</span><span class="s">'latin.small.ligature.st'</span><span class="p">:</span><span class="s">'st'</span><span class="p">})</span>
</pre></div>
</blockquote>
<p>When <em>classname</em> is not listed in <em>extra_chars_dict</em>, it must correspond
to a <a class="reference external" href="http://www.unicode.org/charts/">standard unicode character name</a>,
as in the examples of the following table:</p>
<table border="1" class="docutils">
<colgroup>
<col width="16%" />
<col width="42%" />
<col width="42%" />
</colgroup>
<thead valign="bottom">
<tr><th class="head">Character</th>
<th class="head">Unicode Name</th>
<th class="head">Class Name</th>
</tr>
</thead>
<tbody valign="top">
<tr><td><tt class="docutils literal">!</tt></td>
<td><tt class="docutils literal">EXCLAMATION MARK</tt></td>
<td><tt class="docutils literal">exclamation.mark</tt></td>
</tr>
<tr><td><tt class="docutils literal">2</tt></td>
<td><tt class="docutils literal">DIGIT TWO</tt></td>
<td><tt class="docutils literal">digit.two</tt></td>
</tr>
<tr><td><tt class="docutils literal">A</tt></td>
<td><tt class="docutils literal">LATIN CAPITAL LETTER A</tt></td>
<td><tt class="docutils literal">latin.capital.letter.a</tt></td>
</tr>
<tr><td><tt class="docutils literal">a</tt></td>
<td><tt class="docutils literal">LATIN SMALL LETTER A</tt></td>
<td><tt class="docutils literal">latin.small.letter.a</tt></td>
</tr>
</tbody>
</table>
</div>
<div class="section" id="chars-make-words">
<h2><a class="toc-backref" href="#id8"><tt class="docutils literal">chars_make_words</tt></a></h2>
<p>Groups the given glyphs to words based upon the horizontal distance
between adjacent glyphs.</p>
<dl class="docutils">
<dt>Signature:</dt>
<dd><tt class="docutils literal">chars_make_words (glyphs, threshold=None)</tt></dd>
</dl>
<p>with</p>
<blockquote>
<dl class="docutils">
<dt><em>glyphs</em>:</dt>
<dd>A list of <tt class="docutils literal">Cc</tt> data types, each of which representing a character.
All glyphs must stem from the same single line of text.</dd>
<dt><em>threshold</em>:</dt>
<dd>Horizontal white space greater than <em>threshold</em> will be considered
a word separating gap. When <tt class="docutils literal">None</tt>, the threshold value is
calculated automatically as 2.5 times teh median white space
between adjacent glyphs.</dd>
</dl>
</blockquote>
<p>The result is a nested list of glyphs with each sublist representing
a word. This is the same data structure as used in <a class="reference external" href="gamera.toolkits.ocr.classes.Textline.html">Textline.words</a></p>
</div>
</div>
<div class="section" id="segmentation">
<h1><a class="toc-backref" href="#id9">Segmentation</a></h1>
<p>These functions are used in the segmentation methods of class <a class="reference external" href="gamera.toolkits.ocr.classes.Page.html">Page</a>.
You will generally not need to call them, unless you are implementing
a custom segmentation method.</p>
<div class="section" id="get-line-glyphs">
<h2><a class="toc-backref" href="#id10"><tt class="docutils literal">get_line_glyphs</tt></a></h2>
<p>Splits image regions representing text lines into characters.</p>
<p>Signature:</p>
<blockquote>
<tt class="docutils literal">get_line_glyphs (image, segments)</tt></blockquote>
<p>with</p>
<blockquote>
<dl class="docutils">
<dt><em>image</em>:</dt>
<dd>The document image that is to be further segmentated. It must contin the
same underlying image data as the second argument <em>segments</em></dd>
<dt><em>segments</em>:</dt>
<dd>A list <tt class="docutils literal">Cc</tt> data types, each of which represents a text line region.
The image views must correspond to <em>image</em>, i.e. each pixels has a value
that is the unique label of the text line it belongs to. This is the
interface used by the plugins in the "PageSegmentation" section of the
Gamera core.</dd>
</dl>
</blockquote>
<p>The result is returned as a list of <a class="reference external" href="gamera.toolkits.ocr.classes.Textline.html">Textline</a> objects.</p>
</div>
<div class="section" id="show-bboxes">
<h2><a class="toc-backref" href="#id11"><tt class="docutils literal">show_bboxes</tt></a></h2>
<p>Returns an RGB image with bounding boxes of the given glyphs as
hollow rects. Useful for visualization and debugging of a segmentation.</p>
<p>Signature:</p>
<blockquote>
<tt class="docutils literal">show_bboxes (image, glyphs)</tt></blockquote>
<p>with:</p>
<blockquote>
<dl class="docutils">
<dt><em>image</em>:</dt>
<dd>An image of the textdokument which has to be segmentated.</dd>
<dt><em>glyphs</em>:</dt>
<dd>List of rects which will be drawn on <tt class="docutils literal">image</tt> as hollow rects.
As all image types are derived from <tt class="docutils literal">Rect</tt>, any image list can
be passed.</dd>
</dl>
</blockquote>
</div>
</div>
</div>
</body>
</html>
|