This file is indexed.

/usr/share/doc/dillo/HtmlParser.txt is in dillo 3.0.5-4build1.

This file is owned by root:root, with mode 0o644.

The actual contents of the file can be viewed below.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
 October 2001, --Jcid
 Last update: Jul 2009

                        ---------------
                        THE HTML PARSER
                        ---------------


   Dillo's  parser is more than just a HTML parser, it does XHTML
and  plain  text  also.  It  has  parsing 'modes' that define its
behaviour while working:

   typedef enum {
     DILLO_HTML_PARSE_MODE_INIT = 0,
     DILLO_HTML_PARSE_MODE_STASH,
     DILLO_HTML_PARSE_MODE_STASH_AND_BODY,
     DILLO_HTML_PARSE_MODE_BODY,
     DILLO_HTML_PARSE_MODE_VERBATIM,
     DILLO_HTML_PARSE_MODE_PRE
   } DilloHtmlParseMode;


   The  parser  works  upon a token-grained basis, i.e., the data
stream is parsed into tokens and the parser is fed with them. The
process  is  simple:  whenever  the  cache  has new data, it is
passed to Html_write, which groups data into tokens and calls the
appropriate functions for the token type (tag, space, or word).

   Note:   when  in  DILLO_HTML_PARSE_MODE_VERBATIM,  the  parser
doesn't  try  to  split  the  data stream into tokens anymore; it
simply collects until the closing tag.

------
TOKENS
------

  * A chunk of WHITE SPACE     --> Html_process_space


  * TAG                        --> Html_process_tag

    The tag-start is defined by two adjacent characters:

      first : '<'
      second: ALPHA | '/' | '!' | '?'

      Note: comments are discarded ( <!-- ... --> )


    The tag's end is not as easy to find, nor to deal with!:

    1)  The  HTML  4.01  sec.  3.2.2 states that "Attribute/value
    pairs appear before the final '>' of an element's start tag",
    but it doesn't define how to discriminate the "final" '>'.

    2) '<' and '>' should be escaped as '&lt;' and '&gt;' inside
    attribute values.

    3)  The XML SPEC for XHTML states:
      AttrValue ::== '"' ([^<&"] | Reference)* '"'   |
                     "'" ([^<&'] | Reference)* "'"

    Current  parser  honors  the XML SPEC.

    As  it's  a  common  mistake  for human authors to mistype or
    forget  one  of  the  quote  marks of an attribute value; the
    parser   solves  the  problem  with  a  look-ahead  technique
    (otherwise  the  parser  could  skip significant amounts of
    properly-written HTML).



  * WORD                       --> Html_process_word

    A  word is anything that doesn't start with SPACE, that's
    outside  of  a  tag, up to the first SPACE or tag start.

    SPACE = ' ' | \n | \r | \t | \f | \v


-----------------
THE PARSING STACK
-----------------

  The parsing state of the document is kept in a stack:

  class DilloHtml {
     [...]
     lout::misc::SimpleVector<DilloHtmlState> *stack;
     [...]
  };

  struct _DilloHtmlState {
     CssPropertyList *table_cell_props;
     DilloHtmlParseMode parse_mode;
     DilloHtmlTableMode table_mode;
     bool cell_text_align_set;
     DilloHtmlListMode list_type;
     int list_number;

     /* TagInfo index for the tag that's being processed */
     int tag_idx;

     dw::core::Widget *textblock, *table;

     /* This is used to align list items (especially in enumerated lists) */
     dw::core::Widget *ref_list_item;

     /* This is used for list items etc; if it is set to TRUE, breaks
        have to be "handed over" (see Html_add_indented and
        Html_eventually_pop_dw). */
     bool hand_over_break;
  };

  Basically,  when a TAG is processed, a new state is pushed into
the  'stack'  and  its  'style'  is  set  to  reflect the desired
appearance (details in DwStyle.txt).

  That way, when a word is processed later (added to the Dw), all
the information is within the top state.

  Closing TAGs just pop the stack.