/usr/share/doc/libhtmlparser-java/html/main.html is in libhtmlparser-java-doc 1.6.20060610.dfsg0-8.
This file is owned by root:root, with mode 0o644.
The actual contents of the file can be viewed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 | <!doctype html public "-//w3c//dtd html 4.0 transitional//en">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<meta name="Author" content="Derrick Oswald">
<title>HTMLParser Main</title>
<link REL ="stylesheet" TYPE="text/css" HREF="javadoc/stylesheet.css" TITLE="Style">
</head>
<body>
<h1>HTMLParser</h1>
Welcome to the homepage of HTMLParser - a super-fast real-time
parser for real-world HTML. What has attracted most developers to HTMLParser has
been its simplicity in design, speed and ability to handle streaming real-world
html.
<p>The two fundamental use-cases that are handled by the parser are
<a href="#extraction">extraction</a> and <a href="#transformation">transformation</a>
(the syntheses use-case, where HTML pages are created from scratch, is better
handled by other tools closer to the source of data). While prior versions
concentrated on data extraction from web pages, Version 1.4 of the
HTMLParser has substantial improvements in the area of transforming web
pages, with simplified tag creation and editing, and verbatim toHtml() method
output.
<p>In general, to use the HTMLParser you will need to be able to write code in
the Java programming language. Although some example programs are provided
that may be useful as they stand, it's more than likely you will need (or
want) to create your own programs or modify the ones provided to match your
intended application.
<p>To use the library, you will need to add either the htmllexer.jar or
htmlparser.jar to your classpath when compiling and running. The
htmllexer.jar provides low level access to generic string, remark and tag nodes on
the page in a linear, flat, sequential manner. The htmlparser.jar, which
includes the classes found in htmllexer.jar, provides access to a page as a
sequence of nested differentiated tags containing string, remark and other
tag nodes. So where the output from calls to the lexer
<a href="javadoc/org/htmlparser/lexer/Lexer.html#nextNode()">nextNode()<a>
method might be:
<pre>
<html>
<head>
<title>
"Welcome"
</title>
</head>
<body>
etc...
</pre>
The output from the parser <a
href="javadoc/org/htmlparser/util/NodeIterator.html">NodeIterator</a> would
nest the tags as children of the <html>, <head> and other nodes
(here represented by indentation):
<pre>
<html>
<head>
<title>
"Welcome"
</title>
</head>
<body>
etc...
</pre>
The parser attempts to balance opening tags with ending tags to present the
structure of the page, while the lexer simply spits out nodes. If your
application requires only modest structural knowledge of the page, and is
primarily concerned with individual, isolated nodes, you should consider
using the lightweight lexer. But if your application requires knowledge of
the nested structure of the page, for example processing tables, you will
probably want to use the full parser.
<h2><a name=extraction>Extraction</a></h2>
Extraction encompasses all the information retrieval programs that are not
meant to preserve the source page. This covers uses like:
<ul>
<li>text extraction, for use as input for text search engine databases for example</li>
<li>link extraction, for crawling through web pages or harvesting email
addresses</li>
<li>screen scraping, for programmatic data input from web pages</li>
<li>resource extraction, collecting images or sound</li>
<li>a browser front end, the preliminary stage of page display</li>
<li>link checking, ensuring links are valid</li>
<li>site monitoring, checking for page differences beyond simplistic diffs</li>
</ul>
There are several facilities in the HTMLParser codebase to help with
extraction, including
<a href="javadoc/org/htmlparser/filters/package-summary.html">filters</a>,
<a href="javadoc/org/htmlparser/visitors/package-summary.html">visitors</a> and
<a href="javadoc/org/htmlparser/beans/package-summary.html">JavaBeans</a>.
<h2><a name=transformation>Transformation</a></h2>
Transformation includes all processing where the input <em>and</em> the output
are HTML pages. Some examples are:
<ul>
<li>URL rewriting, modifying some or all links on a page</li>
<li>site capture, moving content from the web to local disk</li>
<li>censorship, removing offending words and phrases from pages</li>
<li>HTML cleanup, correcting erroneous pages</li>
<li>ad removal, excising URLs referencing advertising</li>
<li>conversion to XML, moving existing web pages to XML</li>
</ul>
Transformation can occur 'on the fly' when using
<a href="javadoc/org/htmlparser/tags/package-summary.html">custom tags</a>
in conjunction with the
<a href="javadoc/org/htmlparser/PrototypicalNodeFactory.html">PrototypicalNodeFactory</a>.
Or transformation can occur on a list of nodes after extraction using one or
more <a href="javadoc/org/htmlparser/visitors/package-summary.html">visitors</a>.
In either case you will need to output the NodeList returned by the parse()
method with the <a href="javadoc/org/htmlparser/util/NodeList.html#toHtml()">toHtml()</a>
method.
<p>The HTML Parser is an open source library released under
<a href="http://www.opensource.org/licenses/lgpl-license.html">GNU Lesser General Public
License</a>, which basically says you are free to use the library "as is" in
other (even proprietary) products, as long as due credit is given to the authors
and the source code for the HTMLParser is included or available with the other product.
For modified or embedded use, please consult the
<a href="http://www.opensource.org/licenses/lgpl-license.html">LGPL license</a>.
<div align="right">
<a href="http://sourceforge.net/projects/htmlparser" target="_parent">
<img src="" width="88" height="31" border="0" alt="SourceForge.net">
</a>
</div>
</body>
</html>
|