| D. Raggett. Clean Up Your Web Pahes with HTML TIDY. http://www.w3.org/People/Raggett/tidy/, 1999. |
....identified by the given URL, and fetches the corresponding web document (or so called page object) This page object is used as a sample for XWRAP to interact with the user to learn and derive the important information extraction rules. Second, it cleans up bad HTML tags and syntactical errors [15, 18]. Third, it transforms the retrieved page object into a parse tree or so called syntactic token tree. Information Extraction is the second component, which is responsible for deriving extraction rules that use declarative specification to describe how to extract information content of interest ....
D. Raggett. Clean Up Your Web Pahes with HTML TIDY. http://www.w3.org/People/Raggett/tidy/, 1999.
....repairs end tags in the wrong order or illegal nesting of elements. We describe each type of HTML errors in a normalization rule. The same set of normalization rules can be applied to all HTML documents. Our HTML syntax error reparation module can clean up most of the errors listed in HTML TIDY [27, 30]. 3.3 Generating a Syntactic Token Tree Once the HTML errors and bad formatting are repaired, the clean HTML document is fed to a source languagecompliant tree parser, which parses the block character by character, carving the source document into a sequence of atomic units, called syntactic ....
D. Raggett. Clean Up Your Web Pahes with HTML TIDY. http://www.w3.org/People/Raggett/tidy/, 1999.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC