/usr/share/doc/swi-prolog-doc/Manual/widechars.html is in swi-prolog-doc 5.6.59-1.
This file is owned by root:root, with mode 0o644.
The actual contents of the file can be viewed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 | <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<HTML>
<HEAD>
<TITLE>SWI-Prolog 5.6.59 Reference Manual: Section 2.17</TITLE><LINK REL=home HREF="index.html">
<LINK REL=contents HREF="Contents.html">
<LINK REL=index HREF="DocIndex.html">
<LINK REL=previous HREF="cyclic.html">
<LINK REL=next HREF="limits.html">
<STYLE type="text/css">
/* Style sheet for SWI-Prolog latex2html
*/
dd.defbody
{ margin-bottom: 1em;
}
dt.pubdef
{ background-color: #c5e1ff;
}
pre.code
{ margin-left: 1.5em;
margin-right: 1.5em;
border: 1px dotted;
padding-top: 5px;
padding-left: 5px;
padding-bottom: 5px;
background-color: #f8f8f8;
}
div.navigate
{ text-align: center;
background-color: #f0f0f0;
border: 1px dotted;
padding: 5px;
}
div.title
{ text-align: center;
padding-bottom: 1em;
font-size: 200%;
font-weight: bold;
}
div.author
{ text-align: center;
font-style: italic;
}
div.abstract
{ margin-top: 2em;
background-color: #f0f0f0;
border: 1px dotted;
padding: 5px;
margin-left: 10%; margin-right:10%;
}
div.abstract-title
{ text-align: center;
padding: 5px;
font-size: 120%;
font-weight: bold;
}
div.toc-h1
{ font-size: 200%;
font-weight: bold;
}
div.toc-h2
{ font-size: 120%;
font-weight: bold;
margin-left: 2em;
}
div.toc-h3
{ font-size: 100%;
font-weight: bold;
margin-left: 4em;
}
div.toc-h4
{ font-size: 100%;
margin-left: 6em;
}
span.sec-nr
{
}
span.sec-title
{
}
span.pred-ext
{ font-weight: bold;
}
span.pred-tag
{ float: right;
font-size: 80%;
font-style: italic;
color: #202020;
}
/* Footnotes */
sup.fn { color: blue; text-decoration: underline; }
span.fn-text { display: none; }
sup.fn span {display: none;}
sup:hover span
{ display: block !important;
position: absolute; top: auto; left: auto; width: 80%;
color: #000; background: white;
border: 2px solid;
padding: 5px; margin: 10px; z-index: 100;
font-size: smaller;
}
</STYLE>
</HEAD>
<BODY BGCOLOR="white">
<DIV class="navigate"><A class="nav" href="index.html"><IMG SRC="home.gif" BORDER=0 ALT="Home"></A>
<A class="nav" href="Contents.html"><IMG SRC="index.gif" BORDER=0 ALT="Contents"></A>
<A class="nav" href="DocIndex.html"><IMG SRC="yellow_pages.gif" BORDER=0 ALT="Index"></A>
<A class="nav" href="cyclic.html"><IMG SRC="prev.gif" BORDER=0 ALT="Previous"></A>
<A class="nav" href="limits.html"><IMG SRC="next.gif" BORDER=0 ALT="Next"></A>
</DIV>
<H2><A NAME="sec:2.17"><SPAN class="sec-nr">2.17</SPAN> <SPAN class="sec-title">Wide
character support</SPAN></A></H2>
<A NAME="sec:widechars"></A>
<P><A NAME="idx:UTF8:239"></A><A NAME="idx:Unicode:240"></A><A NAME="idx:UCS:241"></A><A NAME="idx:internationalization:242"></A>SWI-Prolog
supports <EM>wide characters</EM>, characters with character codes above
255 that cannot be represented in a single <EM>byte</EM>.
<EM>Universal Character Set</EM> (UCS) is the ISO/IEC 10646 standard
that specifies a unique 31-bits unsigned integer for any character in
any language. It is a superset of 16-bit Unicode, which in turn is a
superset of ISO 8859-1 (ISO Latin-1), a superset of US-ASCII. UCS can
handle strings holding characters from multiple languages and character
classification (uppercase, lowercase, digit, etc.) and operations such
as case-conversion are unambiguously defined.
<P>For this reason SWI-Prolog has two representations for atoms and
string objects (see <A class="sec" href="strings.html">section 4.23</A>).
If the text fits in ISO Latin-1, it is represented as an array of 8-bit
characters. Otherwise the text is represented as an array of 32-bit
numbers. This representational issue is completely transparent to the
Prolog user. Users of the foreign language interface as described in <A class="sec" href="foreign.html">section
9</A> sometimes need to be aware of these issues though.
<P>Character coding comes into view when characters of strings need to
be read from or written to file or when they have to be communicated to
other software components using the foreign language interface. In this
section we only deal with I/O through streams, which includes file I/O
as well as I/O through network sockets.
<H3><A NAME="sec:2.17.1"><SPAN class="sec-nr">2.17.1</SPAN> <SPAN class="sec-title">Wide
character encodings on streams</SPAN></A></H3>
<A NAME="sec:encoding"></A>
<P>Although characters are uniquely coded using the UCS standard
internally, streams and files are byte (8-bit) oriented and there are a
variety of ways to represent the larger UCS codes in an 8-bit octet
stream. The most popular one, especially in the context of the web, is
UTF-8. Bytes 0 ... 127 represent simply the corresponding
US-ASCII character, while bytes 128 ... 255 are used for
multi-byte encoding of characters placed higher in the UCS space.
Especially on MS-Windows the 16-bit Unicode standard, represented by
pairs of bytes is also popular.
<P>Prolog I/O streams have a property called <EM>encoding</EM> which
specifies the used encoding that influence <A NAME="idx:getcode2:243"></A><A class="pred" href="chario.html#get_code/2">get_code/2</A>
and <A NAME="idx:putcode2:244"></A><A class="pred" href="chario.html#put_code/2">put_code/2</A>
as well as all the other text I/O predicates.
<P>The default encoding for files is derived from the Prolog flag
<A class="flag" href="flags.html#flag:encoding">encoding</A>, which is
initialised from the environment. If the environment variable <CODE>LANG</CODE>
ends in "UTF-8", this encoding is assumed. Otherwise the default is <CODE>text</CODE>
and the translation is left to the wide-character functions of the
C-library.
<SUP class="fn">11<SPAN class="fn-text">The Prolog native UTF-8 mode is
considerably faster than the generic mbrtowc() one.</SPAN></SUP> The
encoding can be specified explicitly in <A NAME="idx:loadfiles2:245"></A><A class="pred" href="consulting.html#load_files/2">load_files/2</A>
for loading Prolog source with an alternative encoding, <A NAME="idx:open4:246"></A><A class="pred" href="IO.html#open/4">open/4</A>
when opening files or using <A NAME="idx:setstream2:247"></A><A class="pred" href="IO.html#set_stream/2">set_stream/2</A>
on any open stream. For Prolog source files we also provide the <A NAME="idx:encoding1:248"></A><A class="pred" href="consulting.html#encoding/1">encoding/1</A>
directive that can be used to switch between encodings that are
compatible to US-ASCII (<CODE>ascii</CODE>,
<CODE>iso_latin_1</CODE>, <CODE>utf8</CODE> and many locales). See also
<A class="sec" href="projectfiles.html">section 3.1.3</A> for writing
Prolog files with non-US-ASCII characters and <A class="sec" href="syntax.html">section
2.15.1.4</A> for syntax issues. For additional information and Unicode
resources, please visit
<A class="url" href="http://www.unicode.org/">http://www.unicode.org/</A>.
<P>SWI-Prolog currently defines and supports the following encodings:
<DL>
<DT><STRONG>octet</STRONG></DT>
<DD class="defbody">
Default encoding for <CODE>binary</CODE> streams. This causes the stream
to be read and written fully untranslated.</DD>
<DT><STRONG>ascii</STRONG></DT>
<DD class="defbody">
7-bit encoding in 8-bit bytes. Equivalent to <CODE>iso_latin_1</CODE>,
but generates errors and warnings on encountering values above 127.</DD>
<DT><STRONG>iso_latin_1</STRONG></DT>
<DD class="defbody">
8-bit encoding supporting many western languages. This causes the stream
to be read and written fully untranslated.</DD>
<DT><STRONG>text</STRONG></DT>
<DD class="defbody">
C-library default locale encoding for text files. Files are read and
written using the C-library functions mbrtowc() and wcrtomb(). This may
be the same as one of the other locales, notably it may be the same as <CODE>iso_latin_1</CODE>
for western languages and <CODE>utf8</CODE> in a UTF-8 context.</DD>
<DT><STRONG>utf8</STRONG></DT>
<DD class="defbody">
Multi-byte encoding of full UCS, compatible to <CODE>ascii</CODE>. See
above.</DD>
<DT><STRONG>unicode_be</STRONG></DT>
<DD class="defbody">
Unicode <EM>Big Endian</EM>. Reads input in pairs of bytes, most
significant byte first. Can only represent 16-bit characters.</DD>
<DT><STRONG>unicode_le</STRONG></DT>
<DD class="defbody">
Unicode <EM>Little Endian</EM>. Reads input in pairs of bytes, least
significant byte first. Can only represent 16-bit characters.
</DD>
</DL>
<P>Note that not all encodings can represent all characters. This
implies that writing text to a stream may cause errors because the
stream cannot represent these characters. The behaviour of a stream on
these errors can be controlled using <A NAME="idx:setstream2:249"></A><A class="pred" href="IO.html#set_stream/2">set_stream/2</A>.
Initially the terminal stream write the characters using Prolog escape
sequences while other streams generate an I/O exception.
<H4><A NAME="sec:2.17.1.1"><SPAN class="sec-nr">2.17.1.1</SPAN> <SPAN class="sec-title">BOM:
Byte Order Mark</SPAN></A></H4>
<A NAME="sec:bom"></A>
<P><A NAME="idx:BOM:250"></A><A NAME="idx:ByteOrderMark:251"></A>From <A class="sec" href="widechars.html">section
2.17.1</A>, you may have got the impression text-files are complicated.
This section deals with a related topic, making live often easier for
the user, but providing another worry to the programmer.
<B>BOM</B> or <EM>Byte Order Marker</EM> is a technique for identifying
Unicode text-files as well as the encoding they use. Such files start
with the Unicode character 0xFEFF, a non-breaking, zero-width space
character. This is a pretty unique sequence that is not likely to be the
start of a non-Unicode file and uniquely distinguishes the various
Unicode file formats. As it is a zero-width blank, it even doesn't
produce any output. This solves all problems, or ... Some formats start
of as US-ASCII and may contain some encoding mark to switch to UTF-8,
such as the <CODE>encoding="UTF-8"</CODE> in an XML header. Such formats
often explicitly forbid the use of a UTF-8 BOM. In other cases there is
additional information telling the encoding making the use of a BOM
redundant or even illegal.
<P>The BOM is handled by SWI-Prolog <A NAME="idx:open4:252"></A><A class="pred" href="IO.html#open/4">open/4</A>
predicate. By default, text-files are probed for the BOM when opened for
reading. If a BOM is found, the encoding is set accordingly and the
property <CODE>bom(true)</CODE> is available through <A NAME="idx:streamproperty2:253"></A><A class="pred" href="IO.html#stream_property/2">stream_property/2</A>.
When opening a file for writing, writing a BOM can be requested using
the option <CODE>bom(true)</CODE> with
<A NAME="idx:open4:254"></A><A class="pred" href="IO.html#open/4">open/4</A>.
<P></BODY></HTML>
|