/usr/share/doc/libhtmlparser-java-doc/html/faq.html is in libhtmlparser-java-doc 1.6.20060610.dfsg0-5.
This file is owned by root:root, with mode 0o644.
The actual contents of the file can be viewed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 | <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html>
<head>
<title>HTML Parser Frequently Asked Questions</title>
<meta name="author" content="
Derrick Oswald" />
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" />
<link REL ="stylesheet" TYPE="text/css" HREF="javadoc/stylesheet.css" TITLE="Style">
</head>
<body class="composite">
<div id="bodyColumn">
<div id="contentBox">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"></meta>
<meta name="KeyWords" content="faq,htmlparser,java"></meta>
<link rel="stylesheet" type="text/css" href="javadoc/stylesheet.css" title="Style"></link>
</head>
<div class="section"><h2>Frequently Asked Questions</h2>
<ul>
<li><a href="#encodingchangeexception">Why am I getting an EncodingChangeException?</a></li>
<li><a href="#post">How can I use POST to fetch a page?</a></li>
<li><a href="#timeout">Is there a way to force a timeout for delinquent pages?</a></li>
<li><a href="#composite">Why aren't <P>, <B>, <I> etc. tags fully nested?</a></li>
<li><a href="#quiet">How can I block parser messages from appearing on stdout?</a></li>
<li><a href="#empty">How does the parser deal with tags like <tag/>?</a></li>
<li><a href="#jsp">How is JSP parsed using the parser?</a></li>
<li><a href="#byte">How do you find the byte offset from the beginning of a document for a tag?</a></li>
</ul>
<a name="encodingchangeexception"></a>
<div class="section"><h3>Why am I getting an EncodingChangeException?</h3>
An EncodingChangeException is thrown to let you, the user, know that some nodes
already handed out by the parser are incorrect according to an encoding directive
in a <META> tag.
<p>When a <META> tag with an encoding directive is encountered, the parser
rescans the input up to the current position using the new encoding.
If a different character results from interpreting the bytes with the
new encoding, the exception is thrown.
</p>
<p>
If you are supplying the parser with your own input, as from a file,
be sure to set the encoding if it is not the default (ISO-8859-1).
You can do this on the Page, Lexer, or Parser objects.
</p>
<p>
If the parser is fetching the data for you, the problem is with the
HTTP server, which should have sent the correct encoding as part of the
Content-Type header string.
Given that you have no control over the server, the only solution is
to reattempt the parse with the new encoding.
</p>
<p>After the exception is thrown, the parser has set it's encoding to the
new value, so you should be able to just reset and reparse, see for
example the handling in StringBean:
<div class="source"><pre>
try
{
... parser.parse (...) throws an EncodingChangeException...
}
catch (EncodingChangeException ece)
{
... do whatever necessary to reset your state here
try
{
// reset the parser
parser.reset ();
// try again with the encoding now in force
parser.parse (...);
}
catch (ParserException pe)
{
}
}
catch (ParserException pe)
{
}
</pre></div>
</p>
</div>
<a name="post"></a>
<div class="section"><h3>How can I use POST to fetch a page?</h3>
<p>The standard HTTP request submitted by the parser is a GET.
The usual request submitted by a form is a POST.
</p>
<p>To illustrate how to use POST with the parser, we'll submit a form to the WHOIS
database of the American Registry for Internet Numbers (ARIN).<br></br>
<i>Note: there is an equivalent GET form at http://ws.arin.net/whois</i>.<br></br>
<i>See also:</i>.
<ul>
<li>RIPE http://www.ripe.net/perl/whois</li>
<li>APNIC http://www.apnic.net/apnic-bin/whois.pl</li>
<li>LACNIC http://lacnic.net/cgi-bin/lacnic/whois</li>
</ul>
<p>On the ARIN web site, the page <a href="http://ws.arin.net/cgi-bin/whois.pl">http://ws.arin.net/cgi-bin/whois.pl</a>
has the following FORM that asks for an IP address and returns the registry details:
</p>
<div class="source"><pre>
<form name="thisform" method="POST" action="/cgi-bin/whois.pl">
<font face="arial,verdana,helvetica" size="2"> Search for : </font>
<input type="text" Name="queryinput" size="20">
<input type="submit"><br>
</form>
</pre></div>
<p>From this we determine that the <tt>METHOD</tt> is <tt>POST</tt> and the form
should be submitted to <tt>/cgi-bin/whois.pl</tt>. This absolute URL is relative to the
page it is found on, so the form should be submitted to <tt>http://ws.arin.net/cgi-bin/whois.pl</tt>
when the <tt>Submit</tt> input is clicked. The only <tt>INPUT</tt> element other than the
<tt>Submit</tt> is a single <tt>text</tt> field named <tt>queryinput</tt> that takes 20 or fewer characters.
Other types of input element are described in <a href="http://www.w3.org/TR/html4/interact/forms.html">http://www.w3.org/TR/html4/interact/forms.html</a>.
</p>
<p>The basic operation is to pass a fully prepared <tt>HttpURLConnection</tt> connected
to the <tt>POST</tt> target URL into the <tt>Parser</tt>, either in the constructor or
via the <tt>setConnection()</tt> method. To condition the connection, use the
<tt>setRequestMethod()</tt> method to set the <tt>POST</tt> operation, and the
<tt>setRequestProperty()</tt> and other explicit method calls.
Then write the input field(s) as an ampersand concatenation
(<tt>"input1=value1&input2=value2&..."</tt>) into the <tt>PrintWriter</tt>
obtained by a call to <tt>getOutputStream()</tt>.
</p>
<p>The following sample program illustrates the principles using a <tt>StringBean</tt>,
but the same code could be used with a <tt>Parser</tt> by replacing the last three lines
in the <tt>try</tt> block with:
</p>
<div class="source"><pre>
parser = new Parser ();
parser.setConnection (connection);
// ... do parser operations
</pre></div>
<p></p>
<div class="source"><pre>
import java.io.PrintWriter;
import java.net.HttpURLConnection;
import java.net.URL;
import java.net.URLConnection;
import org.htmlparser.beans.StringBean;
/**
* WhoIs.java
* Use POST to get information about an IP address from ws.arin.net.
* Created on April 29, 2006, 11:06 PM
*/
public class WhoIs
{
String mText; // text extracted from the response to the POST request
/**
* Creates a new instance of WhoIs.
*/
public WhoIs (String ipaddress)
{
URL url;
HttpURLConnection connection;
StringBuffer buffer;
PrintWriter out;
StringBean bean;
try
{
// from the 'action' (relative to the refering page)
url = new URL ("http://ws.arin.net/cgi-bin/whois.pl");
connection = (HttpURLConnection)url.openConnection ();
connection.setRequestMethod ("POST");
connection.setDoOutput (true);
connection.setDoInput (true);
connection.setUseCaches (false);
// more or less of these may be required
// see Request Header Definitions: http://www.ietf.org/rfc/rfc2616.txt
connection.setRequestProperty ("Accept-Charset", "*");
connection.setRequestProperty ("Referer", "http://ws.arin.net/cgi-bin/whois.pl");
connection.setRequestProperty ("User-Agent", "WhoIs.java/1.0");
buffer = new StringBuffer (1024);
// 'input' fields separated by ampersands (&)
buffer.append ("queryinput=");
buffer.append (ipaddress);
// etc.
out = new PrintWriter (connection.getOutputStream ());
out.print (buffer);
out.close ();
bean = new StringBean ();
bean.setConnection (connection);
mText = bean.getStrings ();
}
catch (Exception e)
{
mText = e.getMessage ();
}
}
public String getText ()
{
return (mText);
}
/**
* Program mainline.
* @param args The ip address (dot notation) to look up.
*/
public static void main (String[] args)
{
if (0 >= args.length)
System.out.println ("Usage: java WhoIs <ipaddress>");
else
System.out.println (new WhoIs (args[0]).getText ());
}
}
</pre></div>
</div>
<a name="timeout"></a>
<div class="section"><h3>Is there a way to force a timeout for delinquent pages?</h3>
<p>If you are using the Sun jvm, try using:
<div class="source"><pre>
System.setProperty ("sun.net.client.defaultReadTimeout", "7000");
System.setProperty ("sun.net.client.defaultConnectTimeout", "7000");
</pre></div>
in the mainline before starting your main application processing.
</p>
<p>This sets the socket timeouts to 7 seconds, but you will need
to catch the I/O exceptions.
</p>
</div>
<a name="composite"></a>
<div class="section"><h3>Why aren't <P>, <B>, <I> etc. tags fully nested?</h3>
<p>Authors are sometimes lazy and often fail to close some tags as
required by the HTML standard. This causes some problems for the parser.
</p>
<p>For this heuristic reason, not all possible tags are registered as composite tags,
which is what generates the 'parent/child' nesting relationship.
It is considered better to have a valid, less nested parse than a possibly invalid parse.
</p>
<p>You are free to add whatever nodes you like as composite nodes using the
prototypical node factory paradigm. First create your class that derives from
<tt>CompositeTagNode</tt> (copy and modify one of the existing tags that is most
like your desired tag):
</p>
<div class="source"><pre>
public class BoldTag extends CompositeTag
{
private static final String[] mIds = new String[] {"B"};
public BoldTag ()
{
}
public String[] getIds ()
{
return (mIds);
}
public String[] getEnders ()
{
return (mIds);
}
public String[] getEndTagEnders ()
{
return (new String[0]);
}
}
</pre></div>
<p>Then, register an instance of your node with a PrototypicalNodeFactory:
</p>
<div class="source"><pre>
PrototypicalNodeFactory factory = new PrototypicalNodeFactory ();
factory.registerTag (new BoldTag ());
parser.setNodeFactory (factory);
</pre></div>
<p>The problem becomes detecting when the tag doesn't have a </B> like
it should, so getEnders() and getEndTagEnders() should probably have a
longer list of tag names. Enders are the tag names that force an end
tag to be generated, while EndTagEnders are the end tags (</xxx>) that force an end
tag to be generated.
</p>
</div>
<a name="quiet"></a>
<div class="section"><h3>How can I block parser messages from appearing on stdout?</h3>
<p>The parser sends warning and error messages to standard output by default.
You might want to block these messages. To achieve this, use a different feedback object:
</p>
<div class="source"><pre>
Parser parser = new Parser ("http://...", new DefaultParserFeedback (DefaultParserFeedback.QUIET));
</pre></div>
<p>The <tt>Parser</tt> class has a static member with just such a construction:
</p>
<div class="source"><pre>
Parser parser = new Parser ("http://...", Parser.DEVNULL);
</pre></div>
<p>You can also switch the feedback to DEBUG mode, to get extra details.
</p>
<div class="source"><pre>
Parser parser = new Parser ("http://...", new DefaultParserFeedback (DefaultParserFeedback.DEBUG));
</pre></div>
<p>
To handle the feedback yourself, implement the <tt>ParserFeedback</tt>,
interface by implementing <tt>info()</tt>, <tt>warning()</tt> and <tt>error()</tt>.
</p>
</div>
<a name="empty"></a>
<div class="section"><h3>How does the parser deal with tags like <tag/>?</h3>
<p>
The parser handles tags ending with a slash as a normal <tt>Tag</tt> object.
The <tt>Tag</tt> interface has a method - <tt>isEmptyXmlTag()</tt> which
returns <tt>true</tt> if is this such an empty xml tag (has no end tag).
</p>
</div>
<a name="jsp"></a>
<div class="section"><h3>How is JSP parsed using the parser?</h3>
<p>There is a <tt>JspTag</tt> class that handles "%", "%=" and "%@" tags,
<em>but not within tags or remarks</em>.
So, the Jsp tag within the tag <tt><input type='<%= MyType %>'></tt> would
not be returned as a tag, but would instead be part of the text of the 'type' attribute,
but the same tag within the text of the page would be returned as a <tt>JspTag</tt> tag.
</p>
</div>
<a name="byte"></a>
<div class="section"><h3>How do you find the byte offset from the beginning of a document for a tag?</h3>
<p>Character positions are much easier to obtain than byte positions.
Each tag returned by the parser or lexer has methods <tt>getStartPosition()</tt> and
<tt>getEndPosition()</tt> which return the starting and ending character positions.
</p>
<p>These can be converted to line and column numbers in a hypothetical text file using
<tt>row()</tt> and <tt>column()</tt> methods on the <tt>Page</tt> object:
</p>
<div class="source"><pre>
Page page = parser.getLexer ().getPage ();
int row = page.row (tag.getStartPosition ()); // note: zero based
int column = page.column (tag.getStartPosition ());
</pre></div>
<p>Converting a character position into a byte position is dependant on the character
encoding used. For the ISO-8859-1 encoding, the correspondence is one byte per character,
but for other encodings, often more than one byte is used per character. Perhaps the
only safe way is to write all the characters, up to the character position of interest,
to a suitably encoded writer on a stream, flush the writer and then examine the
byte position of the underlying stream.
</p>
</div>
</div>
</div>
</div>
<div class="clear">
<hr/>
</div>
<div id="footer">
<div class="xright">©
2001-2006
</div>
<div class="clear">
<hr/>
</div>
</div>
</body>
</html>
|