/usr/share/doc/idzebra-2.0-doc/idzebra-2.0/character-map-files.html is in idzebra-2.0-doc 2.0.59-1.
This file is owned by root:root, with mode 0o644.
The actual contents of the file can be viewed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 | <html><head><meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"><title>2. Charmap Files</title><meta name="generator" content="DocBook XSL Stylesheets V1.78.1"><link rel="home" href="index.html" title="Zebra - User's Guide and Reference"><link rel="up" href="fields-and-charsets.html" title="Chapter 10. Field Structure and Character Sets"><link rel="prev" href="fields-and-charsets.html" title="Chapter 10. Field Structure and Character Sets"><link rel="next" href="icuchain-files.html" title="3. ICU Chain Files"></head><body><link rel="stylesheet" type="text/css" href="common/style1.css"><div class="navheader"><table width="100%" summary="Navigation header"><tr><th colspan="3" align="center">2. Charmap Files</th></tr><tr><td width="20%" align="left"><a accesskey="p" href="fields-and-charsets.html">Prev</a> </td><th width="60%" align="center">Chapter 10. Field Structure and Character Sets
</th><td width="20%" align="right"> <a accesskey="n" href="icuchain-files.html">Next</a></td></tr></table><hr></div><div class="section"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="character-map-files"></a>2. Charmap Files</h2></div></div></div><p>
The character map files are used to define the word tokenization
and character normalization performed before inserting text into
the inverse indexes. <span class="application">Zebra</span> ships with the predefined character map
files <code class="filename">tab/*.chr</code>. Users are allowed to add
and/or modify maps according to their needs.
</p><div class="table"><a name="character-map-table"></a><p class="title"><b>Table 10.1. Character maps predefined in <span class="application">Zebra</span></b></p><div class="table-contents"><table summary="Character maps predefined in Zebra" border="1"><colgroup><col><col><col></colgroup><thead><tr><th>File name</th><th>Intended type</th><th>Description</th></tr></thead><tbody><tr><td><code class="literal">numeric.chr</code></td><td><code class="literal">:n</code></td><td>Numeric digit tokenization and normalization map. All
characters not in the set <code class="literal">-{0-9}.,</code> will be
suppressed. Note that floating point numbers are processed
fine, but scientific exponential numbers are trashed.</td></tr><tr><td><code class="literal">scan.chr</code></td><td><code class="literal">:w or :p</code></td><td>Word tokenization char map for Scandinavian
languages. This one resembles the generic word tokenization
character map <code class="literal">tab/string.chr</code>, the main
differences are sorting of the special characters
<code class="literal">üzæäøöå</code> and equivalence maps according to
Scandinavian language rules.</td></tr><tr><td><code class="literal">string.chr</code></td><td><code class="literal">:w or :p</code></td><td>General word tokenization and normalization character
map, mostly useful for English texts. Use this to derive your
own language tokenization and normalization derivatives.</td></tr><tr><td><code class="literal">urx.chr</code></td><td><code class="literal">:u</code></td><td>URL parsing and tokenization character map.</td></tr><tr><td><code class="literal">@</code></td><td><code class="literal">:0</code></td><td>Do-nothing character map used for literal binary
indexing. There is no existing file associated to it, and
there is no normalization or tokenization performed at all.</td></tr></tbody></table></div></div><br class="table-break"><p>
The contents of the character map files are structured as follows:
</p><div class="variablelist"><dl class="variablelist"><dt><span class="term">encoding <em class="replaceable"><code>encoding-name</code></em></span></dt><dd><p>
This directive must be at the very beginning of the file, and it
specifies the character encoding used in the entire file. If
omitted, the encoding <code class="literal">ISO-8859-1</code> is assumed.
</p><p>
For example, one of the test files found at
<code class="literal">test/rusmarc/tab/string.chr</code> contains the following
encoding directive:
</p><pre class="screen">
encoding koi8-r
</pre><p>
and the test file
<code class="literal">test/charmap/string.utf8.chr</code> is encoded
in UTF-8:
</p><pre class="screen">
encoding utf-8
</pre><p>
</p></dd><dt><span class="term">lowercase <em class="replaceable"><code>value-set</code></em></span></dt><dd><p>
This directive introduces the basic value set of the field type.
The format is an ordered list (without spaces) of the
characters which may occur in "words" of the given type.
The order of the entries in the list determines the
sort order of the index. In addition to single characters, the
following combinations are legal:
</p><p>
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
Backslashes may be used to introduce three-digit octal, or
two-digit hex representations of single characters
(preceded by <code class="literal">x</code>).
In addition, the combinations
\\, \\r, \\n, \\t, \\s (space — remember that real
space-characters may not occur in the value definition), and
\\ are recognized, with their usual interpretation.
</p></li><li class="listitem"><p>
Curly braces {} may be used to enclose ranges of single
characters (possibly using the escape convention described in the
preceding point), e.g., {a-z} to introduce the
standard range of ASCII characters.
Note that the interpretation of such a range depends on
the concrete representation in your local, physical character set.
</p></li><li class="listitem"><p>
parentheses () may be used to enclose multi-byte characters -
e.g., diacritics or special national combinations (e.g., Spanish
"ll"). When found in the input stream (or a search term),
these characters are viewed and sorted as a single character, with a
sorting value depending on the position of the group in the value
statement.
</p></li></ul></div><p>
</p><p>
For example, <code class="literal">scan.chr</code> contains the following
lowercase normalization and sorting order:
</p><pre class="screen">
lowercase {0-9}{a-y}üzæäøöå
</pre><p>
</p></dd><dt><span class="term">uppercase <em class="replaceable"><code>value-set</code></em></span></dt><dd><p>
This directive introduces the
upper-case equivalences to the value set (if any). The number and
order of the entries in the list should be the same as in the
<code class="literal">lowercase</code> directive.
</p><p>
For example, <code class="literal">scan.chr</code> contains the following
uppercase equivalent:
</p><pre class="screen">
uppercase {0-9}{A-Y}ÜZÆÄØÖÅ
</pre><p>
</p></dd><dt><span class="term">space <em class="replaceable"><code>value-set</code></em></span></dt><dd><p>
This directive introduces the character
which separate words in the input stream. Depending on the
completeness mode of the field in question, these characters either
terminate an index entry, or delimit individual "words" in
the input stream. The order of the elements is not significant —
otherwise the representation is the same as for the
<code class="literal">uppercase</code> and <code class="literal">lowercase</code>
directives.
</p><p>
For example, <code class="literal">scan.chr</code> contains the following
space instruction:
</p><pre class="screen">
space {\001-\040}!"#$%&'\()*+,-./:;<=>?@\[\\]^_`\{|}~
</pre><p>
</p></dd><dt><span class="term">map <em class="replaceable"><code>value-set</code></em>
<em class="replaceable"><code>target</code></em></span></dt><dd><p>
This directive introduces a mapping between each of the
members of the value-set on the left to the character on the
right. The character on the right must occur in the value
set (the <code class="literal">lowercase</code> directive) of the
character set, but it may be a parenthesis-enclosed
multi-octet character. This directive may be used to map
diacritics to their base characters, or to map HTML-style
character-representations to their natural form, etc. The
map directive can also be used to ignore leading articles in
searching and/or sorting, and to perform other special
transformations.
</p><p>
For example, <code class="literal">scan.chr</code> contains the following
map instructions among others, to make sure that HTML entity
encoded Danish special characters are mapped to the
equivalent Latin-1 characters:
</p><pre class="screen">
map (&aelig;) æ
map (&oslash;) ø
map (&aring;) å
</pre><p>
</p><p>
In addition to specifying sort orders, space (blank) handling,
and upper/lowercase folding, you can also use the character map
files to make <span class="application">Zebra</span> ignore leading articles in sorting records,
or when doing complete field searching.
</p><p>
This is done using the <code class="literal">map</code> directive in the
character map file. In a nutshell, what you do is map certain
sequences of characters, when they occur <span class="emphasis"><em> in the
beginning of a field</em></span>, to a space. Assuming that the
character "@" is defined as a space character in your file, you
can do:
</p><pre class="screen">
map (^The\s) @
map (^the\s) @
</pre><p>
The effect of these directives is to map either 'the' or 'The',
followed by a space character, to a space. The hat ^ character
denotes beginning-of-field only when complete-subfield indexing
or sort indexing is taking place; otherwise, it is treated just
as any other character.
</p><p>
Because the <code class="literal">default.idx</code> file can be used to
associate different character maps with different indexing types
-- and you can create additional indexing types, should the need
arise -- it is possible to specify that leading articles should
be ignored either in sorting, in complete-field searching, or
both.
</p><p>
If you ignore certain prefixes in sorting, then these will be
eliminated from the index, and sorting will take place as if
they weren't there. However, if you set the system up to ignore
certain prefixes in <span class="emphasis"><em>searching</em></span>, then these
are deleted both from the indexes and from query terms, when the
client specifies complete-field searching. This has the effect
that a search for 'the science journal' and 'science journal'
would both produce the same results.
</p></dd><dt><span class="term">equivalent <em class="replaceable"><code>value-set</code></em></span></dt><dd><p>
This directive introduces equivalence classes of characters
and/or strings for sorting purposes only. It resembles the map
directive, but does not affect search and retrieval indexing,
but only sorting order under present requests.
</p><p>
For example, <code class="literal">scan.chr</code> contains the following
equivalent sorting instructions, which can be uncommented:
</p><pre class="screen">
# equivalent æä(ae)
# equivalent øö(oe)
# equivalent å(aa)
# equivalent uü
</pre><p>
</p></dd></dl></div><p>
</p></div><div class="navfooter"><hr><table width="100%" summary="Navigation footer"><tr><td width="40%" align="left"><a accesskey="p" href="fields-and-charsets.html">Prev</a> </td><td width="20%" align="center"><a accesskey="u" href="fields-and-charsets.html">Up</a></td><td width="40%" align="right"> <a accesskey="n" href="icuchain-files.html">Next</a></td></tr><tr><td width="40%" align="left" valign="top">Chapter 10. Field Structure and Character Sets
</td><td width="20%" align="center"><a accesskey="h" href="index.html">Home</a></td><td width="40%" align="right" valign="top"> 3. ICU Chain Files</td></tr></table></div></body></html>
|