/usr/share/doc/ne/html/Regular-Expressions.html

<html lang="en">
<head>
<title>Regular Expressions - ne's manual</title>
<meta http-equiv="Content-Type" content="text/html">
<meta name="description" content="ne's manual">
<meta name="generator" content="makeinfo 4.13">
<link title="Top" rel="start" href="index.html#Top">
<link rel="up" href="Reference.html#Reference" title="Reference">
<link rel="prev" href="Menus.html#Menus" title="Menus">
<link rel="next" href="Automatic-Preferences.html#Automatic-Preferences" title="Automatic Preferences">
<link href="http://www.gnu.org/software/texinfo/" rel="generator-home" title="Texinfo Homepage">
<meta http-equiv="Content-Style-Type" content="text/css">
<style type="text/css"><!--
  pre.display { font-family:inherit }
  pre.format  { font-family:inherit }
  pre.smalldisplay { font-family:inherit; font-size:smaller }
  pre.smallformat  { font-family:inherit; font-size:smaller }
  pre.smallexample { font-size:smaller }
  pre.smalllisp    { font-size:smaller }
  span.sc    { font-variant:small-caps }
  span.roman { font-family:serif; font-weight:normal; } 
  span.sansserif { font-family:sans-serif; font-weight:normal; } 
--></style>
</head>
<body>
<div class="node">
<a name="Regular-Expressions"></a>
<p>
Next:&nbsp;<a rel="next" accesskey="n" href="Automatic-Preferences.html#Automatic-Preferences">Automatic Preferences</a>,
Previous:&nbsp;<a rel="previous" accesskey="p" href="Menus.html#Menus">Menus</a>,
Up:&nbsp;<a rel="up" accesskey="u" href="Reference.html#Reference">Reference</a>
<hr>
</div>

<h3 class="section">3.8 Regular Expressions</h3>

<p><a name="index-Regular-Expressions-73"></a>
Regular expressions are a powerful way of specifying complex search and
replace operations. <code>ne</code> supports the full regular expression
syntax on US-ASCII and 8-bit buffers, but has to impose a restriction on
character sets when searching in UTF-8 text. See <a href="UTF_002d8-Support.html#UTF_002d8-Support">UTF-8 Support</a>.

<h4 class="subsection">3.8.1 Syntax</h4>

<p>The following section is taken (with minor modifications) from the GNU regular
expression library documentation and is Copyright &copy; Free Software
Foundation.

   <p>A regular expression describes a set of strings.  The simplest case is one
that describes a particular string; for example, the string &lsquo;<samp><span class="samp">foo</span></samp>&rsquo; when
regarded as a regular expression matches &lsquo;<samp><span class="samp">foo</span></samp>&rsquo; and nothing else. 
Nontrivial regular expressions use certain special constructs so that they
can match more than one string.  For example, the regular expression
&lsquo;<samp><span class="samp">foo|bar</span></samp>&rsquo; matches either the string &lsquo;<samp><span class="samp">foo</span></samp>&rsquo; or the string
&lsquo;<samp><span class="samp">bar</span></samp>&rsquo;; the regular expression &lsquo;<samp><span class="samp">c[ad]*r</span></samp>&rsquo; matches any of the strings
&lsquo;<samp><span class="samp">cr</span></samp>&rsquo;, &lsquo;<samp><span class="samp">car</span></samp>&rsquo;, &lsquo;<samp><span class="samp">cdr</span></samp>&rsquo;, &lsquo;<samp><span class="samp">caar</span></samp>&rsquo;, &lsquo;<samp><span class="samp">cadddar</span></samp>&rsquo; and all other
such strings with any number of &lsquo;<samp><span class="samp">a</span></samp>&rsquo;'s and &lsquo;<samp><span class="samp">d</span></samp>&rsquo;'s.

   <p>Regular expressions have a syntax in which a few characters are special
constructs and the rest are <dfn>ordinary</dfn>.  An ordinary character is a
simple regular expression which matches that character and nothing else. The
special characters are &lsquo;<samp><span class="samp">$</span></samp>&rsquo;, &lsquo;<samp><span class="samp">^</span></samp>&rsquo;, &lsquo;<samp><span class="samp">.</span></samp>&rsquo;, &lsquo;<samp><span class="samp">*</span></samp>&rsquo;, &lsquo;<samp><span class="samp">+</span></samp>&rsquo;,
&lsquo;<samp><span class="samp">?</span></samp>&rsquo;, &lsquo;<samp><span class="samp">[</span></samp>&rsquo;, &lsquo;<samp><span class="samp">]</span></samp>&rsquo; , &lsquo;<samp><span class="samp">(</span></samp>&rsquo;, &lsquo;<samp><span class="samp">)</span></samp>&rsquo; and &lsquo;<samp><span class="samp">\</span></samp>&rsquo;.  Any other
character appearing in a regular expression is ordinary, unless a &lsquo;<samp><span class="samp">\</span></samp>&rsquo;
precedes it.

   <p>For example, &lsquo;<samp><span class="samp">f</span></samp>&rsquo; is not a special character, so it is ordinary,
and therefore &lsquo;<samp><span class="samp">f</span></samp>&rsquo; is a regular expression that matches the string &lsquo;<samp><span class="samp">f</span></samp>&rsquo;
and no other string.  (It does <em>not</em> match the string &lsquo;<samp><span class="samp">ff</span></samp>&rsquo;.)  Likewise,
&lsquo;<samp><span class="samp">o</span></samp>&rsquo; is a regular expression that matches only &lsquo;<samp><span class="samp">o</span></samp>&rsquo;.

   <p>Any two regular expressions <var>a</var> and <var>b</var> can be concatenated. 
The result is a regular expression that matches a string if <var>a</var>
matches some amount of the beginning of that string and <var>b</var>
matches the rest of the string.

   <p>As a simple example, we can concatenate the regular expressions
&lsquo;<samp><span class="samp">f</span></samp>&rsquo; and &lsquo;<samp><span class="samp">o</span></samp>&rsquo; to get the regular expression &lsquo;<samp><span class="samp">fo</span></samp>&rsquo;,
which matches only the string &lsquo;<samp><span class="samp">fo</span></samp>&rsquo;.  Still trivial.

   <p>Note: special characters are treated as ordinary ones if they are in
contexts where their special meanings make no sense.  For example,
&lsquo;<samp><span class="samp">*foo</span></samp>&rsquo; treats &lsquo;<samp><span class="samp">*</span></samp>&rsquo; as ordinary since there is no preceding
expression on which the &lsquo;<samp><span class="samp">*</span></samp>&rsquo; can act. It is poor practice to depend on
this behaviour; better to quote the special character anyway, regardless of
where is appears.

   <p>The following are the characters and character sequences that have special
meaning within regular expressions. Any character not mentioned here is not
special; it stands for exactly itself for the purposes of searching and
matching.

     <dl>
<dt>&lsquo;<samp><span class="samp">.</span></samp>&rsquo;<dd>is a special character that matches anything except a newline. Using
concatenation, we can make regular expressions like &lsquo;<samp><span class="samp">a.b</span></samp>&rsquo;, which matches
any three-character string which begins with &lsquo;<samp><span class="samp">a</span></samp>&rsquo; and ends with
&lsquo;<samp><span class="samp">b</span></samp>&rsquo;.

     <br><dt>&lsquo;<samp><span class="samp">*</span></samp>&rsquo;<dd>is not a construct by itself; it is a suffix, which means the preceding
regular expression is to be repeated as many times as possible.  In
&lsquo;<samp><span class="samp">fo*</span></samp>&rsquo;, the &lsquo;<samp><span class="samp">*</span></samp>&rsquo; applies to the &lsquo;<samp><span class="samp">o</span></samp>&rsquo;, so &lsquo;<samp><span class="samp">fo*</span></samp>&rsquo; matches
&lsquo;<samp><span class="samp">f</span></samp>&rsquo; followed by any number of &lsquo;<samp><span class="samp">o</span></samp>&rsquo;'s.

     <p>The case of zero &lsquo;<samp><span class="samp">o</span></samp>&rsquo;'s is allowed: &lsquo;<samp><span class="samp">fo*</span></samp>&rsquo; does match
&lsquo;<samp><span class="samp">f</span></samp>&rsquo;.

     <p>&lsquo;<samp><span class="samp">*</span></samp>&rsquo; always applies to the <em>smallest</em> possible preceding
expression. Thus, &lsquo;<samp><span class="samp">fo*</span></samp>&rsquo; has a repeating &lsquo;<samp><span class="samp">o</span></samp>&rsquo;, not a repeating
&lsquo;<samp><span class="samp">fo</span></samp>&rsquo;.

     <br><dt>&lsquo;<samp><span class="samp">+</span></samp>&rsquo;<dd>&lsquo;<samp><span class="samp">+</span></samp>&rsquo; is like &lsquo;<samp><span class="samp">*</span></samp>&rsquo; except that at least one match for the preceding
pattern is required for &lsquo;<samp><span class="samp">+</span></samp>&rsquo;.  Thus, &lsquo;<samp><span class="samp">c[ad]+r</span></samp>&rsquo; does not match
&lsquo;<samp><span class="samp">cr</span></samp>&rsquo; but does match anything else that &lsquo;<samp><span class="samp">c[ad]*r</span></samp>&rsquo; would match.

     <br><dt>&lsquo;<samp><span class="samp">?</span></samp>&rsquo;<dd>&lsquo;<samp><span class="samp">?</span></samp>&rsquo; is like &lsquo;<samp><span class="samp">*</span></samp>&rsquo; except that it allows either zero or one match for
the preceding pattern.  Thus, &lsquo;<samp><span class="samp">c[ad]?r</span></samp>&rsquo; matches &lsquo;<samp><span class="samp">cr</span></samp>&rsquo; or &lsquo;<samp><span class="samp">car</span></samp>&rsquo;
or &lsquo;<samp><span class="samp">cdr</span></samp>&rsquo;, and nothing else.

     <br><dt>&lsquo;<samp><span class="samp">[ ... ]</span></samp>&rsquo;<dd>&lsquo;<samp><span class="samp">[</span></samp>&rsquo; begins a <dfn>character set</dfn>, which is terminated by a &lsquo;<samp><span class="samp">]</span></samp>&rsquo;. 
In the simplest case, the characters between the two form the set. 
Thus, &lsquo;<samp><span class="samp">[ad]</span></samp>&rsquo; matches either &lsquo;<samp><span class="samp">a</span></samp>&rsquo; or &lsquo;<samp><span class="samp">d</span></samp>&rsquo;,
and &lsquo;<samp><span class="samp">[ad]*</span></samp>&rsquo; matches any string of &lsquo;<samp><span class="samp">a</span></samp>&rsquo;'s and &lsquo;<samp><span class="samp">d</span></samp>&rsquo;'s
(including the empty string), from which it follows that
&lsquo;<samp><span class="samp">c[ad]*r</span></samp>&rsquo; matches &lsquo;<samp><span class="samp">car</span></samp>&rsquo;, <i>et cetera</i>.

     <p>Character ranges can also be included in a character set, by writing two
characters with a &lsquo;<samp><span class="samp">-</span></samp>&rsquo; between them.  Thus, &lsquo;<samp><span class="samp">[a-z]</span></samp>&rsquo; matches any
lower-case letter.  Ranges may be intermixed freely with individual
characters, as in &lsquo;<samp><span class="samp">[a-z$%.]</span></samp>&rsquo;, which matches any lower case letter or
&lsquo;<samp><span class="samp">$</span></samp>&rsquo;, &lsquo;<samp><span class="samp">%</span></samp>&rsquo; or period.

     <p>Note that the usual special characters are not special any more inside a
character set.  A completely different set of special characters exists
inside character sets: &lsquo;<samp><span class="samp">]</span></samp>&rsquo;, &lsquo;<samp><span class="samp">-</span></samp>&rsquo; and &lsquo;<samp><span class="samp">^</span></samp>&rsquo;.

     <p>To include a &lsquo;<samp><span class="samp">]</span></samp>&rsquo; in a character set, you must make it
the first character.  For example, &lsquo;<samp><span class="samp">[]a]</span></samp>&rsquo; matches &lsquo;<samp><span class="samp">]</span></samp>&rsquo; or &lsquo;<samp><span class="samp">a</span></samp>&rsquo;. 
To include a &lsquo;<samp><span class="samp">-</span></samp>&rsquo;, you must use it in a context where it cannot possibly
indicate a range: that is, as the first character, or immediately
after a range.

     <p>Note that when searching in UTF-8 text, a character set may contain
US-ASCII characters only.

     <br><dt>&lsquo;<samp><span class="samp">[^ ... ]</span></samp>&rsquo;<dd>&lsquo;<samp><span class="samp">[^</span></samp>&rsquo; begins a <dfn>complement character set</dfn>, which matches any
character except the ones specified.  Thus, &lsquo;<samp><span class="samp">[^a-z0-9A-Z]</span></samp>&rsquo; matches
all characters <em>except</em> letters and digits. Also in this case, when
searching in UTF-8 text a complemented character set may contain US-ASCII
characters only.

     <p>&lsquo;<samp><span class="samp">^</span></samp>&rsquo; is not special in a character set unless it is the first character. 
The character following the &lsquo;<samp><span class="samp">^</span></samp>&rsquo; is treated as if it were first (it may
be a &lsquo;<samp><span class="samp">-</span></samp>&rsquo; or a &lsquo;<samp><span class="samp">]</span></samp>&rsquo;).

     <br><dt>&lsquo;<samp><span class="samp">^</span></samp>&rsquo;<dd>is a special character that matches the empty string &ndash; but only if at the
beginning of a line in the text being matched.  Otherwise it fails to match
anything.  Thus, &lsquo;<samp><span class="samp">^foo</span></samp>&rsquo; matches a &lsquo;<samp><span class="samp">foo</span></samp>&rsquo; that occurs at the
beginning of a line.

     <br><dt>&lsquo;<samp><span class="samp">$</span></samp>&rsquo;<dd>is similar to &lsquo;<samp><span class="samp">^</span></samp>&rsquo; but matches only at the end of a line. Thus,
&lsquo;<samp><span class="samp">xx*$</span></samp>&rsquo; matches a string of one or more &lsquo;<samp><span class="samp">x</span></samp>&rsquo;'s at the end of a
line.

     <br><dt>&lsquo;<samp><span class="samp">\</span></samp>&rsquo;<dd>has two functions: it quotes the above special characters (including
&lsquo;<samp><span class="samp">\</span></samp>&rsquo;), and it introduces additional special constructs.

     <p>Because &lsquo;<samp><span class="samp">\</span></samp>&rsquo; quotes special characters, &lsquo;<samp><span class="samp">\$</span></samp>&rsquo; is a regular
expression that matches only &lsquo;<samp><span class="samp">$</span></samp>&rsquo;, and &lsquo;<samp><span class="samp">\[</span></samp>&rsquo; is a regular
expression that matches only &lsquo;<samp><span class="samp">[</span></samp>&rsquo;, and so on.

     <p>For the most part, &lsquo;<samp><span class="samp">\</span></samp>&rsquo; followed by any character matches only that
character.  However, there are several exceptions: characters which, when
preceded by &lsquo;<samp><span class="samp">\</span></samp>&rsquo;, are special constructs.  Such characters are always
ordinary when encountered on their own.

     <br><dt>&lsquo;<samp><span class="samp">|</span></samp>&rsquo;<dd>specifies an alternative. Two regular expressions <var>a</var> and <var>b</var> with
&lsquo;<samp><span class="samp">|</span></samp>&rsquo; in between form an expression that matches anything that either
<var>a</var> or <var>b</var> will match.

     <p>Thus, &lsquo;<samp><span class="samp">foo|bar</span></samp>&rsquo; matches either &lsquo;<samp><span class="samp">foo</span></samp>&rsquo; or &lsquo;<samp><span class="samp">bar</span></samp>&rsquo; but no other
string.

     <p>&lsquo;<samp><span class="samp">|</span></samp>&rsquo; applies to the largest possible surrounding expressions.  Only a
surrounding &lsquo;<samp><span class="samp">( ... )</span></samp>&rsquo; grouping can limit the grouping power of
&lsquo;<samp><span class="samp">|</span></samp>&rsquo;.

     <br><dt>&lsquo;<samp><span class="samp">( ... )</span></samp>&rsquo;<dd>is a grouping construct that serves three purposes:

          <ol type=1 start=1>
<li>To enclose a set of &lsquo;<samp><span class="samp">|</span></samp>&rsquo; alternatives for other operations. 
Thus, &lsquo;<samp><span class="samp">(foo|bar)x</span></samp>&rsquo; matches either &lsquo;<samp><span class="samp">foox</span></samp>&rsquo; or &lsquo;<samp><span class="samp">barx</span></samp>&rsquo;.

          <li>To enclose a complicated expression for the postfix &lsquo;<samp><span class="samp">*</span></samp>&rsquo; to operate on. 
Thus, &lsquo;<samp><span class="samp">ba(na)*</span></samp>&rsquo; matches &lsquo;<samp><span class="samp">bananana</span></samp>&rsquo; <i>et cetera</i>, with any (zero or
more) number of &lsquo;<samp><span class="samp">na</span></samp>&rsquo;'s.

          <li>To mark a matched substring for future reference.

          </ol>

     <p>This last application is not a consequence of the idea of a parenthetical
grouping; it is a separate feature that happens to be assigned as a second
meaning to the same &lsquo;<samp><span class="samp">( ... )</span></samp>&rsquo; construct because there is no
conflict in practice between the two meanings.  Here is an explanation of
this feature:

     <br><dt>&lsquo;<samp><span class="samp">\</span><var>digit</var></samp>&rsquo;<dd>After the end of a &lsquo;<samp><span class="samp">( ... )</span></samp>&rsquo; construct, the matcher remembers the
beginning and end of the text matched by that construct.  Then, later on in
the regular expression, you can use &lsquo;<samp><span class="samp">\</span></samp>&rsquo; followed by <var>digit</var> to mean
&ldquo;match the same text matched the <var>digit</var>'th time by the &lsquo;<samp><span class="samp">(
... )</span></samp>&rsquo; construct.&rdquo;  The &lsquo;<samp><span class="samp">( ... )</span></samp>&rsquo; constructs are numbered
in order of commencement in the regexp.

     <p>The strings matching the first nine &lsquo;<samp><span class="samp">( ... )</span></samp>&rsquo; constructs appearing
in a regular expression are assigned numbers 1 through 9 in order of their
beginnings. 
&lsquo;<samp><span class="samp">\1</span></samp>&rsquo; through &lsquo;<samp><span class="samp">\9</span></samp>&rsquo; may be used to refer to the text matched by
the corresponding &lsquo;<samp><span class="samp">( ... )</span></samp>&rsquo; construct.

     <p>For example, &lsquo;<samp><span class="samp">(.+)\1</span></samp>&rsquo; matches any non empty string that is composed of
two identical halves.  The &lsquo;<samp><span class="samp">(.+)</span></samp>&rsquo; matches the first half, which may be
anything non empty, but the &lsquo;<samp><span class="samp">\1</span></samp>&rsquo; that follows must match the same exact
text.

     <br><dt>&lsquo;<samp><span class="samp">\b</span></samp>&rsquo;<dd>matches the empty string, but only if it is at the beginning or
end of a word.  Thus, &lsquo;<samp><span class="samp">\bfoo\b</span></samp>&rsquo; matches any occurrence of
&lsquo;<samp><span class="samp">foo</span></samp>&rsquo; as a separate word.  &lsquo;<samp><span class="samp">\bball(s|)\b</span></samp>&rsquo; matches
&lsquo;<samp><span class="samp">ball</span></samp>&rsquo; or &lsquo;<samp><span class="samp">balls</span></samp>&rsquo; as a separate word.

     <br><dt>&lsquo;<samp><span class="samp">\B</span></samp>&rsquo;<dd>matches the empty string, provided it is <em>not</em> at the beginning or end
of a word.

     <br><dt>&lsquo;<samp><span class="samp">\&lt;</span></samp>&rsquo;<dd>matches the empty string, but only if it is at the beginning
of a word.

     <br><dt>&lsquo;<samp><span class="samp">\&gt;</span></samp>&rsquo;<dd>matches the empty string, but only if it is at the end of a word.

     <br><dt>&lsquo;<samp><span class="samp">\w</span></samp>&rsquo;<dd>matches any word-constituent character. These are US-ASCII letters,
numbers and the underscore, independently on the buffer encoding.

     <br><dt>&lsquo;<samp><span class="samp">\W</span></samp>&rsquo;<dd>matches any character that is not a word-constituent. 
</dl>

<h4 class="subsection">3.8.2 Replacing regular expressions</h4>

<p>Also the replacement string has some special feature when doing a regular
expression search and replace. Exactly as during the search, &lsquo;<samp><span class="samp">\</span></samp>&rsquo; followed
by <var>digit</var> stands for &ldquo;the text matched the <var>digit</var>'th time by the
&lsquo;<samp><span class="samp">( ... )</span></samp>&rsquo; construct in the search expression&rdquo;. Moreover, &lsquo;<samp><span class="samp">\0</span></samp>&rsquo;
represent the whole string matched by the regular expression. Thus, for
instance, the replace string &lsquo;<samp><span class="samp">\0\0</span></samp>&rsquo; has the effect of doubling any string
matched.

   <p>Another example: if you search for &lsquo;<samp><span class="samp">(a+)(b+)</span></samp>&rsquo;, replacing with
&lsquo;<samp><span class="samp">\2x\1</span></samp>&rsquo;, you will match any string composed by a series of &lsquo;<samp><span class="samp">a</span></samp>&rsquo;'s
followed by a series of &lsquo;<samp><span class="samp">b</span></samp>&rsquo;'s, and you will replace it with the
string obtained by moving the &lsquo;<samp><span class="samp">a</span></samp>&rsquo; in front of the &lsquo;<samp><span class="samp">b</span></samp>&rsquo;'s, adding
moreover &lsquo;<samp><span class="samp">x</span></samp>&rsquo; inbetween. For instance, &lsquo;<samp><span class="samp">aaaab</span></samp>&rsquo; will be matched and
replaced by &lsquo;<samp><span class="samp">bxaaaa</span></samp>&rsquo;.

   <p>Note that the backslash character can escape itself. Thus, to put a
backslash in the replacement string, you have to use &lsquo;<samp><span class="samp">\\</span></samp>&rsquo;.

   </body></html>
ne-doc 2.5-1 / usr / share / doc / ne / html / Regular-Expressions.html