/usr/share/doc/flexml/html/paper.html is in flexml 1.9.6-1.
This file is owned by root:root, with mode 0o644.
The actual contents of the file can be viewed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 | <?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html>
<head>
<title>Generating Fast Validating XML Processors</title>
<!--$Id: paper.html,v 1.13 2006/07/18 18:21:13 mquinson Exp $-->
<style type="text/css">
<!--
BODY { BACKGROUND-COLOR: #ffffff; FONT-FAMILY: arial, times new roman, sans-serif; }
A:link, A:visited, A:active { TEXT-DECORATION: none; FONT-WEIGHT: bold; COLOR: #0000FF}
H1, H2 { TEXT-ALIGN: center; FONT-WEIGHT: bold; }
H3, H4, H5 { TEXT-ALIGN: left; FONT-WEIGHT: bold; }
H6 { TEXT-ALIGN: center; FONT-WEIGHT: bold; FONT-SIZE: small; }
H6.CAPTION { TEXT-ALIGN: center; FONT-WEIGHT: bold; FONT-SIZE: small; FONT-STYLE: italic; }
P { TEXT-INDENT: 1em; }
P.CODE { TEXT-INDENT: 0; COLOR: #FF0000; }
UL, OL, DL { FONT-SIZE: small; }
UL { list-style: square; }
LI { FONT-SIZE: small; FONT-WEIGHT: bold; }
UL EM, OL EM { FONT-WEIGHT: bold; }
CODE, CITE { FONT-WEIGHT: bold; }
BLOCKQUOTE { MARGIN-LEFT: 1em; MARGIN-RIGHT: 1em }
IMG { VERTICAL-ALIGN: top; ALIGN: center; }
SUP { COLOR: #0000FF; FONT-SIZE: small; }
-->
</style>
</head>
<body>
<h2>
Generating Fast Validating XML Processors
</h2>
<h6>
(Extended Abstract)
</h6>
<h6>
Kristoffer Rose<br/>
LIP, ENS-Lyon<br/>
<a href="mailto:krisrose@debian.org">krisrose@debian.org</a>
</h6>
<h6> Abstract </h6>
<p>We present <em>FleXML</em>, a program that generates fast
validating XML processors from `self-contained' XML DTDs. It uses
the <em>flex</em> (lexical analyser generator) program to
translate the DTD into a <em>finite automaton</em> enriched with a
stack with the `element context'. This means that the XML
processor will act directly on each character received. The
program is freely redistributable and modifyable (under GNU
`copyleft').</p>
<h6> Keywords </h6>
<p>Validating XML, DTD, lexical analysis, finite automata.</p>
<h4> Overview </h4>
<p>The `X' of XML stands for <em>Extensible</em> [<cite><a
href="#XML">XML</a></cite>]. This signifies that each and every
XML document specifies in its header the details of the format
that it will use and <em>may</em> change its format a bit relative
to the used base format.</p>
<p>
This has influenced the tools available to write validating XML
processors: they use a <em>call-back</em> model where the XML
processor passes strings with the tags and attributes names and
values to the application. These call-backs must be generic
because one cannot know whether a document will start by
extending its own notation with more tags and attributes. For
<em>well-formed</em> but non-validated XML documents this makes
a lot of sense, of course, but we would in general like to
exploit the information in the DTD for optimizations.</p>
<p>
In particular, for many applications a <em>fixed</em> format
suffices in the sense that a single DTD is used without
individual extensions for a large number of documents. In that
case we can do much better because the possible tags and
attributes are static.</p>
<p>
We have implemented an XML processor <em>generator</em> using
the <cite><a href="#Flex">Flex</a></cite> scanner generator that
produces deterministic finite automata [<cite><a
href="#ASU">ASU</a></cite>]. Which means that there is almost
no overhead for XML processing: the generated XML processors
read the XML document character by character and can immediately
dispatch the actions associated with each element (or reject the
document as invalid).</p>
<p>
Furthermore we have devised a simple extension of the C
programming language that facilitates the writing of `element
actions' in C, making it easy to write really fast XML
applications. In particular we represent XML attribute values
efficiently in C when this is possible, thus avoiding the
otherwise ubiquitous conversions between strings and data
values.</p>
<p>
FleXML is available for free
(from <a href="http://flexml.sourceforge.net">SourceForge</a>).
In this paper we present FleXML through an
elaborated <a href="#what">example</a> and discuss some of the
<a href="#how">technical issues</a>.</p>
<h4><a name="what">What can it do?</a></h4>
<p>Assume that we have an XML document <code>my-joke.xml</code>
containing the classical joke
<blockquote><code><pre><!DOCTYPE joke SYSTEM "my.dtd">
<joke><line>My appartment is so small</line> <suspense/>
<line type='punch-line'>the mice are round-shouldered</line></joke>
</pre></code></blockquote>
(and many others like it, of course). We wish to create an XML
processor to validate it with respect to the DTD in the file
<code>my.dtd</code> containing
<blockquote><code><pre><!-- my.dtd: Small DTD for jokes (just for fun). -->
<!ELEMENT joke (line,(line|suspense)*)>
<!ELEMENT line (#PCDATA)>
<!ATTLIST line type (normal|question|punch-line) 'normal'>
<!ELEMENT suspense EMPTY>
</pre></code></blockquote>
and, furthermore, we wish to write an XML application for
displaying such messages in an amusing way.</p>
<p>
With FleXML this can be done by creating an `actions file'
<code>my-show.act</code> which implements the desired actions
for each element. The remainder of this section explains the
contents of such an actions file.</p>
<p>
An actions file is itself an XML document which must begin with
<blockquote><code><pre><!DOCTYPE actions SYSTEM "flexml-act.dtd">
<actions>
</pre></code></blockquote>
(the <code>flexml-act.dtd</code> DTD is part of the FleXML
system and is reproduced in the manual page.</p>
<p>
We decide that our application should react to a
<code>line</code> element by printing the text inside it, and
that it should differentiate between the three possible `type'
attribute values by printing corresponding trailing punctuation.</p>
<p>
This introduces a slight complication, because the attribute
values are available when parsing the start tag whereas the
element text is not available until we parse the end tag (where
it has been read).</p>
<p>
This means that we must declare a top-level variable.
<blockquote><code><pre><top><![CDATA[
char* terminator = ".";
]]></top>
</pre></code></blockquote>
Notice how we use <code>CDATA</code> sections to make sure that
all characters (including white-space) are passed unmodified to
the C compiler.</p>
<p>
With this we can write the action to set it when reading the
<code>line</code> start tag as
<blockquote><code><pre><start tag='line'><![CDATA[
switch ( {type} ) {
case {!type}: terminator = "..."; break;
case {type=normal}: terminator = "."; break;
case {type=question}: terminator = "??"; break;
case {type=punch-line}: terminator = "!!"; break;
}
]]></start>
</pre></code></blockquote></p>
<p>
The idea is that the enumeration attribute <code>type</code> is
implemented in C as if it had been declared by
<blockquote><code><pre>enum { {!type}, {type=normal}, {type=question}, {type=punch-line} } {type};
</pre></code></blockquote>
(understanding the <code>{...}</code> units as C identifiers),
hence the possibility of using the fast C <code>switch</code>
statement to pick the right choice directly. Note that the
first choice, <code>{!<em>type</em>}</code>, corresponds to not
setting the attribute; in this example the attribute has a
default value so this can never happen, however, we include the
choice anyway to prevent the C compiler from issuing warnings
about missing choices in <code>switch</code> statements.</p>
<p>
With this in place we can write the action for
<code></line></code>. Since it prints something, however,
we first need to add the appropriate header inclusion.
<blockquote><code><pre><top><![CDATA[
#include <stdio.h>
]]></top>
<end tag='line'><![CDATA[
printf("%s%s\n", pcdata, terminator);
]]></end>
</pre></code></blockquote></p>
<p>
Finally, we will make the application amusing by `displaying'
the <code><suspense/></code> empty tag as a short delay;
this also involves a header inclusion.
<blockquote><code><pre><top><![CDATA[
#include <unistd.h>
]]></top>
<start tag='suspense'><![CDATA[
sleep(2);
]]></start>
</pre></code></blockquote></p>
<p>
That is all; the only remaining thing is to terminate the action
file properly.
<blockquote><code><pre></actions>
</pre></code></blockquote>
<p>
We can now run FleXML with the DTD and the actions file as input
and will get an XML application as output that, when run (after
processing by flex and a C compiler), will indeed print
<blockquote><code><pre>My appartment is so small.
the mice are round-shouldered!!
</pre></code></blockquote>
as expected, pausing duly for two seconds between the lines. On
the authors system the above output was achieved with the
command sequence
<blockquote><code><pre>flexml -A -a my-show.act my.dtd
flex -omy-show.c my-show.l
cc -omy-show my-show.c
./my-show <./my-joke.xml
</pre></code></blockquote>
(see the manual page for the exact meaning of the FleXML options).
<p>
An important aspect of the design of FleXML is that the only
thing that should matter to the programmer should be the
complexity of the <em>application</em>, not of the used DTD. As
an example the following action file prints the
<code>href</code> attribute of all hyperlinks in an XHTML
document:
<blockquote><code><pre><!DOCTYPE actions SYSTEM "flexml-act.dtd">
<actions>
<top><![CDATA[
#include <stdio.h>
]]></top>
<start tag='a'><![CDATA[
if ({href}) printf("%s\n", {href});
]]></start>
</actions>
</pre></code></blockquote>
which was compiled into a running application on the author's
system with the commands
<blockquote><code><pre>flexml $(FLEXDEBUG) -rhtml -p'-//IETF//DTD XHTML 1.0 Transitional//EN' \
'http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd'
gcc -Wall -ansi -pedantic -O2 -g -c xhtml-href.c -o xhtml-href.o
flex -Bsv -Ca -oxhtml1-transitional.c xhtml1-transitional.l
gcc -Wall -ansi -pedantic -O2 -g -c xhtml1-transitional.c -o xhtml1-transitional.o
gcc -Wall -ansi -pedantic xhtml-href.o xhtml1-transitional.o -o xhtml-href
</pre></code></blockquote>
generating the XML processor directly from the official DTD on
the web (which in fact required a patch to flex to enlarge the
possible table size).
<h4><a name="how">How does it work?</a></h4>
<p>
FleXML is a perl script [<cite><a href="#Perl">Perl</a></cite>]
which reads and interprets a DTD and subsequently produces an
<em>XML processor</em> source file for the <em>flex</em> scanner
and optionally an <em>XML application</em> with the C source of
the element actions from an actions file. The DTD is used to
construct the rules used by flex to match the individual XML
components in such a way that only valid documents match.
<p>
For example, the flex code for scanning an attribute declaration
of the <code>line</code> tag is the following:
<blockquote><code><pre><AL_line>{
"type"{Eq}"'normal'" |
"type"{Eq}"\"normal\"" A_line_type = A_line_type_normal;
"type"{Eq}"'question'" |
"type"{Eq}"\"question\"" A_line_type = A_line_type_question;
"type"{Eq}"'punch-line'" |
"type"{Eq}"\"punch-line\"" A_line_type = A_line_type_punch_d_line;
">" {
LEAVE; STag_line(); pcdata = BUFFERSET(pcdata); ENTER(IN_line);
}
"/>" {
LEAVE; STag_line(); pcdata = ""; ETag_line();
switch (YY_START) {
case ROOT_line: SET(EPILOG); break;
case S_joke: SET(S_joke_1); break;
case S_joke_1: case S_joke_2: case S_joke_3: SET(S_joke_3); break;
}}}
</pre></code></blockquote>
(with <code>{Eq}</code> an alias for the regular expression
matching an equal sign (corresponding to production `[25] Eq' of
the XML specification).
<p>
This reads as follows: when the XML processor is reading the
attribute list of the <code>line</code> tag, i.e., when it is in
the <code><AL_line></code> state, a `<code>t</code>' will
enter an internal state where a `<code>y</code>' proceeds to
another internal state but other characters makes the document
invalid (because no rule permits it). Once the equal sign has
been scanned, the next characters determine the attribute value,
and at the end one of the flex actions is performed, setting the
attribute value (<code>A_line_type</code> is the real C for what
we wrote as <code>{type}</code>, etc.). The important thing is
that one can ensure by careful tuning of the flex rules that a
valid document will proceed only by looking each character up in
a table and determining the subsequent `state' and `action'.
One must avoid pairs of rules such as
<blockquote><code><pre>"-->" LEAVE(COMMENT);
. SKIP;
</pre></code></blockquote>
(a single `<code>.</code>' matches any character) because they
mean that the scanner will not be sure after having read a
`<code>-</code>' character whether it is part of a comment
terminator or `just' a dash. In such cases an extra rule must be
inserted because for the set
<blockquote><code><pre>"-->" LEAVE(COMMENT);
"--" |
. SKIP;
</pre></code></blockquote>
the problem goes away.
<p>
After the actual attribute rules, two rules handle termination
of the attribute list. There are two cases corresponding to
whether we just read a start tag or an empty element. In case
it was a start tag then we must enter the `inner' mode of the
element called <code>IN_line</code> for the <code>line</code>
element. The <code>switch</code> handles the state changes
needed for the line element resulting from the fact that the
element can appear in different contexts. This is always
possible to construct because of the requirement that an XML DTD
must be <em>deterministic</em>: we just need an element content
stack (this is what the <code>LEAVE</code> and
<code>ENTER</code> macros are for).
<h4><a name="why">Why is it useful?</a></h4>
<p>
In comparison with the forthcoming XML Style-sheet Language
[<cite><a href="#XSL">XSL</a></cite>] our approach is much more
primitive for better and worse: only a limited range of
applications can be produced with FleXML but those will be very
fast.
<p>
This is useful for XML applications that are meant to process a
large number of documents in a fixed format. One such
application is the NTSys u-payment transaction server which is
implemented as an Apache module where performance is of premium
importance. Using FleXML permits a highly modular development
of modules for the various transaction types based on a common
DTD and a collection of applications that are generated
separately and all linked together with the common processor.
<p>
FleXML is still under development: please try it out (either
from <a "http://flexml.sourceforge.net">SourceForge</a>
or from the <a href="www.debian.org">Debian
GNU/Linux</a> distribution where FleXML is include from release
2.2. The author would welcome comments as to how the system can
best evolve. Two things that are definitely on the agenda is a
limited `context condition' language for expressing constraints
on the position of an element in a document (corresponding to
the Xpath subset of XSL), and an efficient way to combine
several DTDs into one large t facilitate general XML servers
(that can even dispatch to a generic XML interface in cases
where the FleXML restrictions are not respected by a document).
<h4> Acknowledgements </h4>
<p>
I am grateful to <a href="http://www.ntsys.fr">NTSys</a> for
supporting the development of FleXML. Finally extend my sincere
thanks to Jef Poskanzer, Vern Paxson, and the rest of the
<em>flex</em> maintainers for a great tool.
<h4> References </h4>
<dl compact>
<dt><a name="ASU"><cite>ASU</cite></a>
<dd>
Alfred Aho, Ravi Sethi and Jeffrey Ullman: <em>Compilers:
Principles, Techniques and Tools</em>, Addison-Wesley
(1986).<p>
<dt><a name="Flex"><cite>Flex</cite></a>
<dd>
Jef Poskanzer, Vern Paxson, <em>et. al.</em>: <em>Flex - fast
lexical analyzer generator</em>.<p>
<dt><a name="Perl"><cite>Perl</cite></a>
<dd>
Larry Wall, <em>Perl - Practical Extraction and Report
Language</em>.<p>
<dt><a name="XML"><cite>XML</cite></a>
<dd>
Extensible Markup Language (XML) 1.0 (W3C
Recommendation REC-xml-1998-0210).<p>
<dt><a name="XSL"><cite>XSL</cite></a>
<dd>
Extensible Stylesheet Language (XSL) (W3C Working Draft).<p>
</dl>
<hr>
<address>Copyright (c)
<a href="mailto:krisrose@debian.org">Kristoffer Rose</a>.
<!-- Created: Thu Dec 9 08:09:41 CET 1999 -->
<!-- hhmts start -->Last modified: Tue Feb 11 13:56:40 EST 2003 <!-- hhmts end -->
</address>
</body>
</html>
|