/usr/share/doc/idzebra-2.0-doc/idzebra-2.0/character-map-files.html

<html><head><meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"><title>2. Charmap Files</title><meta name="generator" content="DocBook XSL Stylesheets V1.78.1"><link rel="home" href="index.html" title="Zebra - User's Guide and Reference"><link rel="up" href="fields-and-charsets.html" title="Chapter 10. Field Structure and Character Sets"><link rel="prev" href="fields-and-charsets.html" title="Chapter 10. Field Structure and Character Sets"><link rel="next" href="icuchain-files.html" title="3. ICU Chain Files"></head><body><link rel="stylesheet" type="text/css" href="common/style1.css"><div class="navheader"><table width="100%" summary="Navigation header"><tr><th colspan="3" align="center">2. Charmap Files</th></tr><tr><td width="20%" align="left"><a accesskey="p" href="fields-and-charsets.html">Prev</a> </td><th width="60%" align="center">Chapter 10. Field Structure and Character Sets
  </th><td width="20%" align="right"> <a accesskey="n" href="icuchain-files.html">Next</a></td></tr></table><hr></div><div class="section"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="character-map-files"></a>2. Charmap Files</h2></div></div></div><p>
    The character map files are used to define the word tokenization
    and character normalization performed before inserting text into
    the inverse indexes. <span class="application">Zebra</span> ships with the predefined character map
    files <code class="filename">tab/*.chr</code>. Users are allowed to add
    and/or modify maps according to their needs.  
   </p><div class="table"><a name="character-map-table"></a><p class="title"><b>Table 10.1. Character maps predefined in <span class="application">Zebra</span></b></p><div class="table-contents"><table summary="Character maps predefined in Zebra" border="1"><colgroup><col><col><col></colgroup><thead><tr><th>File name</th><th>Intended type</th><th>Description</th></tr></thead><tbody><tr><td><code class="literal">numeric.chr</code></td><td><code class="literal">:n</code></td><td>Numeric digit tokenization and normalization map. All
         characters not in the set <code class="literal">-{0-9}.,</code> will be
         suppressed. Note that floating point numbers are processed
         fine, but scientific exponential numbers are trashed.</td></tr><tr><td><code class="literal">scan.chr</code></td><td><code class="literal">:w or :p</code></td><td>Word tokenization char map for Scandinavian
         languages. This one resembles the generic word tokenization
         character map <code class="literal">tab/string.chr</code>, the main
         differences are sorting of the special characters 
        <code class="literal">üzæäøöå</code> and equivalence maps according to
         Scandinavian language rules.</td></tr><tr><td><code class="literal">string.chr</code></td><td><code class="literal">:w or :p</code></td><td>General word tokenization and normalization character
         map, mostly useful for English texts. Use this to derive your
         own language tokenization and normalization derivatives.</td></tr><tr><td><code class="literal">urx.chr</code></td><td><code class="literal">:u</code></td><td>URL parsing and tokenization character map.</td></tr><tr><td><code class="literal">@</code></td><td><code class="literal">:0</code></td><td>Do-nothing character map used for literal binary
         indexing. There is no existing file associated to it, and
         there is no normalization or tokenization performed at all.</td></tr></tbody></table></div></div><br class="table-break"><p>
    The contents of the character map files are structured as follows:
    </p><div class="variablelist"><dl class="variablelist"><dt><span class="term">encoding <em class="replaceable"><code>encoding-name</code></em></span></dt><dd><p>
	This directive must be at the very beginning of the file, and it
        specifies the character encoding used in the entire file. If
        omitted, the encoding <code class="literal">ISO-8859-1</code> is assumed.
       </p><p>
        For example, one of the test files found at  
          <code class="literal">test/rusmarc/tab/string.chr</code> contains the following
        encoding directive:
        </p><pre class="screen">
         encoding koi8-r
        </pre><p>
          and the test file
          <code class="literal">test/charmap/string.utf8.chr</code> is encoded
          in UTF-8:
        </p><pre class="screen">
         encoding utf-8
        </pre><p>
       </p></dd><dt><span class="term">lowercase <em class="replaceable"><code>value-set</code></em></span></dt><dd><p>
	This directive introduces the basic value set of the field type.
	The format is an ordered list (without spaces) of the
	characters which may occur in "words" of the given type.
	The order of the entries in the list determines the
	sort order of the index. In addition to single characters, the
	following combinations are legal:
       </p><p>

	</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
	   Backslashes may be used to introduce three-digit octal, or
	   two-digit hex representations of single characters
	   (preceded by <code class="literal">x</code>).
	   In addition, the combinations
	   \\, \\r, \\n, \\t, \\s (space &#8212; remember that real
	   space-characters may not occur in the value definition), and
	   \\ are recognized, with their usual interpretation.
	  </p></li><li class="listitem"><p>
	   Curly braces {} may be used to enclose ranges of single
	   characters (possibly using the escape convention described in the
	   preceding point), e.g., {a-z} to introduce the
	   standard range of ASCII characters.
	   Note that the interpretation of such a range depends on
	   the concrete representation in your local, physical character set.
	  </p></li><li class="listitem"><p>
	   parentheses () may be used to enclose multi-byte characters -
	   e.g., diacritics or special national combinations (e.g., Spanish
	   "ll"). When found in the input stream (or a search term),
	   these characters are viewed and sorted as a single character, with a
	   sorting value depending on the position of the group in the value
	   statement.
	  </p></li></ul></div><p>

       </p><p>
        For example, <code class="literal">scan.chr</code> contains the following
        lowercase normalization and sorting order:
        </p><pre class="screen">
         lowercase {0-9}{a-y}üzæäøöå
        </pre><p>
       </p></dd><dt><span class="term">uppercase <em class="replaceable"><code>value-set</code></em></span></dt><dd><p>
	This directive introduces the
	upper-case equivalences to the value set (if any). The number and
	order of the entries in the list should be the same as in the
	<code class="literal">lowercase</code> directive.
       </p><p>
        For example, <code class="literal">scan.chr</code> contains the following
        uppercase equivalent:
        </p><pre class="screen">
         uppercase {0-9}{A-Y}ÜZÆÄØÖÅ
        </pre><p>
       </p></dd><dt><span class="term">space <em class="replaceable"><code>value-set</code></em></span></dt><dd><p>
	This directive introduces the character
	which separate words in the input stream. Depending on the
	completeness mode of the field in question, these characters either
	terminate an index entry, or delimit individual "words" in
	the input stream. The order of the elements is not significant &#8212;
	otherwise the representation is the same as for the
	<code class="literal">uppercase</code> and <code class="literal">lowercase</code>
	directives.
       </p><p>
        For example, <code class="literal">scan.chr</code> contains the following
        space instruction:
        </p><pre class="screen">
         space {\001-\040}!"#$%&amp;'\()*+,-./:;&lt;=&gt;?@\[\\]^_`\{|}~
        </pre><p>
       </p></dd><dt><span class="term">map <em class="replaceable"><code>value-set</code></em>
       <em class="replaceable"><code>target</code></em></span></dt><dd><p>
	This directive introduces a mapping between each of the
	members of the value-set on the left to the character on the
	right. The character on the right must occur in the value
	set (the <code class="literal">lowercase</code> directive) of the
	character set, but it may be a parenthesis-enclosed
	multi-octet character. This directive may be used to map
	diacritics to their base characters, or to map HTML-style
	character-representations to their natural form, etc. The
	map directive can also be used to ignore leading articles in
	searching and/or sorting, and to perform other special
	transformations.
       </p><p>
        For example, <code class="literal">scan.chr</code> contains the following
        map instructions among others, to make sure that HTML entity
        encoded  Danish special characters are mapped to the
        equivalent Latin-1 characters:
        </p><pre class="screen">
         map (&amp;aelig;)      æ
         map (&amp;oslash;)     ø
         map (&amp;aring;)      å
        </pre><p>
	</p><p>
	In addition to specifying sort orders, space (blank) handling,
	and upper/lowercase folding, you can also use the character map
	files to make <span class="application">Zebra</span> ignore leading articles in sorting records,
	or when doing complete field searching.
       </p><p>
	This is done using the <code class="literal">map</code> directive in the
	character map file. In a nutshell, what you do is map certain
	sequences of characters, when they occur <span class="emphasis"><em> in the
	 beginning of a field</em></span>, to a space. Assuming that the
	character "@" is defined as a space character in your file, you
	can do:
	</p><pre class="screen">
	 map (^The\s) @
	 map (^the\s) @
	</pre><p>
	The effect of these directives is to map either 'the' or 'The',
	followed by a space character, to a space. The hat ^ character
	denotes beginning-of-field only when complete-subfield indexing
	or sort indexing is taking place; otherwise, it is treated just
	as any other character.
       </p><p>
	Because the <code class="literal">default.idx</code> file can be used to
	associate different character maps with different indexing types
	-- and you can create additional indexing types, should the need
	arise -- it is possible to specify that leading articles should
	be ignored either in sorting, in complete-field searching, or
	both.
       </p><p>
	If you ignore certain prefixes in sorting, then these will be
	eliminated from the index, and sorting will take place as if
	they weren't there. However, if you set the system up to ignore
	certain prefixes in <span class="emphasis"><em>searching</em></span>, then these
	are deleted both from the indexes and from query terms, when the
	client specifies complete-field searching. This has the effect
	that a search for 'the science journal' and 'science journal'
	would both produce the same results.
       </p></dd><dt><span class="term">equivalent <em class="replaceable"><code>value-set</code></em></span></dt><dd><p>
	This directive introduces equivalence classes of characters
	and/or strings for sorting purposes only. It resembles the map
	directive, but does not affect search and retrieval indexing,
	but only sorting order under present requests. 
       </p><p>
        For example, <code class="literal">scan.chr</code> contains the following
        equivalent sorting instructions, which can be uncommented:
        </p><pre class="screen">
         # equivalent æä(ae)
         # equivalent øö(oe)
         # equivalent å(aa)
         # equivalent uü
        </pre><p>
       </p></dd></dl></div><p>
   </p></div><div class="navfooter"><hr><table width="100%" summary="Navigation footer"><tr><td width="40%" align="left"><a accesskey="p" href="fields-and-charsets.html">Prev</a> </td><td width="20%" align="center"><a accesskey="u" href="fields-and-charsets.html">Up</a></td><td width="40%" align="right"> <a accesskey="n" href="icuchain-files.html">Next</a></td></tr><tr><td width="40%" align="left" valign="top">Chapter 10. Field Structure and Character Sets
   </td><td width="20%" align="center"><a accesskey="h" href="index.html">Home</a></td><td width="40%" align="right" valign="top"> 3. ICU Chain Files</td></tr></table></div></body></html>
idzebra-2.0-doc 2.0.59-1 / usr / share / doc / idzebra-2.0-doc / idzebra-2.0 / character-map-files.html