/usr/share/jed/doc/txt/dfa.txt

DFA-based Syntax Highlighting
=============================

DFA highlighting is an alternative syntax highlighting mechanism to
Jed's original simple one. It's a lot more powerful, but it takes up
more memory and makes the executable larger if it's compiled in.
It's also more difficult to design new highlighting modes for.

DFA highlighting works *alongside* Jed's standard highlighting system in
the sense that the user can choose which scheme is to be used on a
mode-by-mode basis.

Some examples of what DFA highlighting can do that the standard scheme
can't are:

- Correct separation of numeric tokens in C. The text `2+3' would
  get highlighted as a single number by the old scheme, since `+' is
  a valid numeric character (when preceded by an E). DFA
  highlighting can spot that the `+' is not a valid numeric
  character in _this_ instance, though, and correctly interpret it
  as an operator.

- Enhanced HTML mode, in which tags containing mismatched quotes
  (such as `<a href="filename>') can be highlighted in a different
  colour from correctly formed tags.

- Much improved Perl mode, in general.

- PostScript mode, in which up to two levels of nested parentheses
  can be detected inside a string constant.

Limitations
-----------

- Jed's DFA highlight rules work only on a line-by-line basis. Using
  the DFA scheme, it is impossible to highlight multiline comments or
  string literals.
  
- DFA rules replace the "traditional" highlighting scheme, so you
  cannot have both highlight of multiline tokens and
  regualar-expression based highlight rules.

Using DFA Highlighting
----------------------

If Jed is compiled with DFA highlighting enabled, it will define the
S-Lang preprocessor name `HAS_DFA_SYNTAX', and also define three
extra functions: `dfa_enable_highlight_cache', `dfa_define_highlight_rule'
and `dfa_build_highlight_table'. These are documented in Jed's ordinary
function help.

To implement a DFA highlighting scheme, you define a number of
highlighting rules using `dfa_define_highlight_rule', and then enable
the scheme using `dfa_build_highlight_table', which will build the
internal data structure (DFA table) that is actually used to do the
highlighting.

Generating the DFA table can take a long time, especially for complex
modes such as C (or even more so, PostScript). For this reason, the
DFA tables can be cached by the use of `dfa_enable_highlight_cache'.
You call this routine before defining any highlighting rules. If the
cache file exists, the DFA table will be loaded directly from it, and
the subsequent calls to `dfa_define_highlight_rule' and
`dfa_build_highlight_table' will do nothing. If the cache file does
not exist, then after Jed has built the DFA table it will attempt to
create the cache.

Cache files are searched along the set of paths specified by the
`Jed_Highlight_Cache_Path' variable.  The default value for
`Jed_Highlight_Cache_Path' is $JED_ROOT/lib, which assumes that cache
files were created when the editor was installed via the optional
installation step

         jed -batch -l preparse

On systems such as Unix, the average user has no permission to create
cache files in $JED_ROOT/lib.  Hence, if the necessary cache files
were not ceated during the installation step, it may be advantageous
for the user to set the `Jed_Highlight_Cache_Dir' variable to a
directory where cache files may be created.

Highlighting Rules
------------------

Highlighting rules are basically regular expressions. You define
regular-expression patterns for the objects that you want to
highlight, and specify the colour that each object should be
highlighted. Colours are specified as `keyword', `normal',
`operator', `delimiter' and so on.

A sample highlighting rule, from C mode, might look like this:

dfa_define_highlight_rule("0[xX][0-9A-Fa-f]*[LlUu]*", "number", "C");

This specified that in the syntax table called `C', any object
matching the regular expression `0[xX][0-9A-Fa-f]*[LU]*' should be
highlighted in the colour assigned to numbers. This regular
expression matches C hexadecimal integer constants: a zero, an X (of
either case), a sequence of hex digits, and optionally an L or a U
on the end (for `long' or `unsigned').

Regular expression syntax is as follows:

- A normal character matches itself. Normal characters include
  everything except special characters, which are ^ $ | * + ? [ ] -
  . ( ) and the backslash \.

- A character class [abcde] matches any one of the characters inside
  it. Ranges can be specified with a dash, e.g. [a-e]. A character
  class starting with a caret matches any single character _not_
  inside it, e.g. [^a-e] matches anything except a, b, c, d or e.

- A period (.) matches any character.

- A character, or a character class, or a regular expression in
  parentheses, can be followed by *, + or ?. If followed by * then
  it will match any number of occurrences of the original
  expression, including none at all; followed by + it will match any
  number *not* including zero; followed by ? it will match zero or
  one.

- Two regular expressions separated by | will match either one.

- A caret at the beginning of an expression causes it to match only
  when at the beginning of a line. A dollar at the end causes it to
  match only when at the end.

- If you want to match one of the special characters, you can remove
  its special properties by placing a backslash before it. This
  includes the backslash itself.

So, for example:

	apple|banana		matches `apple' or `banana'
	(apple|banana)?		matches `apple', `banana' or nothing
	b[ae]d			matches `bad' or `bed'
	[a-e]			matches `a', `b', `c', `d' or `e'
	[a\-e]			matches `a', `-' or `e'
	^#include		matches `#include', but only at the start
				of a line
	'[^']*'			matches any sequence of non-single-quotes
				with a single-quote at each end, such as
				a Pascal string literal
	'[^']$			matches any sequence of non-single-quotes
				with a single-quote at the beginning and
				occurring at the end of a line, such as
				a Pascal string literal that the user has
				not finished typing

To define a highlight rule, you think up the regular expression,
express it as an S-Lang string literal, and include it in a call to
`dfa_define_highlight_rule'.

CAUTION: S-Lang strings obey the same syntax as C strings. This
means that if you need a double quote or a backslash as part of your
regular expression, you have to put *another* backslash before it
when you write it as an S-Lang string. So the fifth example above
might read

	dfa_define_highlight_rule ("[a\\-e]", ...);

with the backslash doubled. SLang-2 introduced a suffix notation for literal
strings, so now it is possible to avoid the doubling of backslashes by use
of the do-not-expand 'R' suffix. The above example can be written as

	dfa_define_highlight_rule ("[a\-e]"R, ...);

Extra Magical Bits
------------------

The second argument to `dfa_define_highlight_rule' is a colour name.
This colour name can be prefixed by a few special letters for extra
magical effects:

`Q' causes the match to be _quick_. Most of the time, the regular
expression matcher finds the _longest_ string starting at the
current position that matches something. A `Q' rule will match with
far higher priority, and will match the _shortest_ string possible.
For example, consider the expression `/\*.*\*/' which matches `/*',
then any sequence of characters, then `*/' - a one-line C comment.
The difficulty is that C comments do not nest, and a sequence like

/* comment */ not comment */

should only be highlighted as a comment up to the _first_ `*/'. The
normal longest-match heuristic will highlight the _whole_ thing as a
comment, which is wrong. You can get round this by defining the rule
as quick, like this:

	dfa_define_highlight_rule("/\\*.*\\*/", "Qcomment", "C");

`P' denotes a _preprocessor-type_ rule. Preprocessor-type rules
state that not only should the matched text be given the specified
colour, but so should everything on the rest of the line, _except_
things in the comment colour. This allows comments on preprocessor
lines, with quite a high level of sophistication: defining, in C mode,

	dfa_define_highlight_rule("^[ \t]*#", "PQpreprocess", "C");

will cause the following effects:

	#define FLAG			comes up in preprocessor colour
	#define FLAG /* comment */	the comment is highlighted right
	#include "/*sdfs*/"		the comment does _not_ get seen!

Finally, `K' defines a _keyword_ rule. In a keyword rule, the
matched text is compared to the active keyword tables for the syntax
scheme, and given the correct keyword colour if a match is found.
If no keyword matches the text, the text will be highlighted in the
colour that was _actually_ specified in the rule.

Further Reading
---------------

If you want to design _really_ complicated highlighting schemes, it
may be that a full understanding of the principles and theory behind
the DFA scheme may be helpful. Most books on compiler theory will
give a good discussion of this.
jed-common 1:0.99.19-4 / usr / share / jed / doc / txt / dfa.txt