| Attardi, G., Gull, A., Sebastiani, F., Automatic Web Page Categorization by Link and Context Analysis, THAI 1999. |
.... (or citations) are being actively used to improve web search engine ranking [4] improve web crawlers [6] discover web communities [8] organize search results into hubs and authorities [13] make predictions about similarity between research papers [16] and even to classify target web pages [20, 9, 2, 5, 3]. The basic assumption made by citation or link analysis is that a link is often created because of a subjective connection between the original document and the cited, or linked to document. For example, if I am making a web page about my hobbies, and I like playing scrabble, I might link to an ....
....extended anchortexts, we replaced actual downloaded documents with virtual documents. We define a virtual document as a collection of anchortexts or extended anchortexts from links pointing to the target document. Our definition is similar to the concept of blurbs described by Attardi, et al. [2]. This is similar to what was done by Furnkranz [9] Anchortext refers to the words occurring inside of a link as shown in Figure 1. We define extended anchortext as the set of rendered words occurring up to 25 words before and after an associated link (as well as the anchortext itself) Figure 1 ....
Giuseppe Attardi, Antonio Gull, and Fabrizio Sebastiani. Automatic Web page categorization by link and context analysis. In Chris Hutchison and Gaetano Lanzarone, editors, Proceedings of THAI-99, 1st European Symposium on Telematics, Hypermedia and Artificial Intelligence, pages 105--119, Varese, IT, 1999.
....a common assumption made when applying a learning approach is that items to learn from and items to be classified are of the same type. The approach we describe here violates this assumption, but it is forced by the requirement stated above. Also, this approach has been followed by Attardi et al. [2] when classifying web pages, where the documents to be classified are artificially constructed using the text surrounding the hyperlinks to the pages to be categorized. 4.3 Machine Learning Linear Classifiers For training a text categorization system, a number of Machine Learning approaches have ....
Attardi, G., and Gull, A. and Sebastiani, F. (1999) Automatic Web Page Categorization by Link and Context Analysis. In Proceedings of THAI-99, 1st European Symposium on Telematics, Hypermedia and Artificial Intelligence, pp. 105-119.
....we motivated why distributed ontologies are necessary for community support. We showed that a mapping between ontologies is required. In this section we will motivate the mapping further. We will also present a mapping between ontologies based on a web page categorization technique introduced in [23]. Working with personal ontologies is convenient for the user. If the user finds a new document of interest for her, she classifies it according to her personal ontology for later retrieval. However, the user is also a member of a community, the members of which have shared interests with the ....
....global, other user) by automatic text categorization. The only assumption that is made, is that the target ontology and especially natural language identifiers for the concepts have to exist. 5. 2 Web page categorization by context Our text categorization is based on a concept presented in [23], where web documents are classified by context instead of content. For categorization of a page p, information from pages referring to p by hyperlinks is used. This method of automatic categorization is based on the following assumptions: a web page which refers to a page p contains ....
G. Attardi, A. Gull`i, and F. Sebastiani, "Automatic web page categorization by link and context analysis," in THAI-99, 1st European Symposium on Telematics, Hypermedia and Artificial Intelligence, C. Hutchison and G. Lanzarone, Eds., 1999, pp. p. 105--119.
.... for searching heterogeneous information sources that leverages the use of metadata[17] and automated classification[18] The hyperlink structure of the web can be exploited for automated classification by using the anchor text and other context from linking documents as a source of text features[19]. Approaches to efficient web spidering[20] 21] have been investigated and are especially important for very large scale crawling efforts. A complete system for automatically building searchable databases of domain specific web resources using a combination of techniques such as automated ....
G. Attardi, A. Gulli, and F. Sebastiani. Automatic Web Page Categorization by Link and Context Analysis. In Chris Hutchison and Gaetano Lanzarone (eds.), Proceedings of THAI'99, European Symposium on Telematics, Hypermedia and Artificial Intelligence, 105-119, 1999.
.... Galleries, and Centers Arts 75 500 100 300 Management Consulting Consulting 300 500 100 300 Table 1: Yahoo categories used to test classification accuracy, numbers are posifive negative Yahoo Category Full Text Anchortext Extended AT Combined Sampled Sampled Biology 51.3 90 55.1 97.3 72.9 98 80.4 97.3 83.1 98 9.8 Archaeology 65.5 92.7 72.2 98.3 83.2 99.2 91.6 98.4 94.4 99.2 8.7 Wildlife 83.3 97.3 76.7 99 87.1 99 96.6 99 96.6 99 4.6 Museums 57 93.7 80 98 87 98.7 89 98.3 94 98.7 6.3 Mgmt Consulting 74 88.7 56.7 95 81.1 95 88.9 92.3 92.2 95 9.5 Average 66.2 92.5 68.3 97.5 ....
....Consulting Consulting 300 500 100 300 Table 1: Yahoo categories used to test classification accuracy, numbers are posifive negative Yahoo Category Full Text Anchortext Extended AT Combined Sampled Sampled Biology 51.3 90 55.1 97.3 72.9 98 80.4 97.3 83.1 98 9. 8 Archaeology 65.5 92.7 72.2 98.3 83.2 99.2 91.6 98.4 94.4 99.2 8.7 Wildlife 83.3 97.3 76.7 99 87.1 99 96.6 99 96.6 99 4.6 Museums 57 93.7 80 98 87 98.7 89 98.3 94 98.7 6.3 Mgmt Consulting 74 88.7 56.7 95 81.1 95 88.9 92.3 92.2 95 9.5 Average 66.2 92.5 68.3 97.5 82.2 98 89.3 97.1 92.1 98.0 7.7 Table 2: Percentage ....
[Article contains additional citation context not shown here]
G. Attardi, A. Gullf, and F. Sebastiani. Automatic Web page categorization by link and context analysis. In C. Hutchison and G. Lanzarone, editors, Proceedings e]'THAI-99, 1st European ,mposium on Telematics, Hypermedia and Artificial Intelligence, pages 105-119, Varese, IT, 1999.
....traditional method based on keyword frequency analysis cannot be used for web documents. The link based approach is an automatic web page categorization technique based on the fact that a web page that refers to a document must contain enough hints about its content to induce someone to read it [13]. Such hints can be used to classify the document being referred as has been done according to [13] We observe that the methods used so far, are based to a great extent on the textual information contained in the page. We now present our approach based on structure of the page and image and ....
.... approach is an automatic web page categorization technique based on the fact that a web page that refers to a document must contain enough hints about its content to induce someone to read it [13] Such hints can be used to classify the document being referred as has been done according to [13] We observe that the methods used so far, are based to a great extent on the textual information contained in the page. We now present our approach based on structure of the page and image and multimedia content in the page. 3. Structure based approach Structure based approach relies on the ....
Guiseppe Attardi, Antonio Gulli, Fabrizio Sebastiani, Automatic Web Page Categorization by Link and Context Analysis.
....a common assumption made when applying a learning approach is that items to learn from and items to be classified are of the same type. The approach we describe here violates this assumption, but it is forced by the requirement stated above. Also, this approach has been followed by Attardi et al. [3] when classifying web pages, where the documents to be classified are artificially constructed using the text surrounding the hyperlinks to the pages to be categorized. 3.1.2 Machine Learning Linear Classifiers For training a text categorization system, a number of Machine Learning approaches ....
Attardi, G., and Gull, A. and Sebastiani, F. (1999) Automatic Web Page Categorization by Link and Context Analysis. In Proceedings of THAI-99, 1st European Symposium on Telematics, Hypermedia and Artificial Intelligence, pp. 105-119.
....of too many matches. The search covers the whole Internet and thus has not just access to much more data than what we assume, but user queries are much more varied and thus domain independent than searches in intranets or on single Web sites. Automatic classification of Web pages is the focus of (Attardi et al. 1999). They introduce categorization by context which exploits information surrounding the links pointing to a document in order to classify that document. It is seen in contrast to categorization by content which relies on textual information found in a document. The basic assumptions are: 1) a Web ....
G. Attardi, A. Gull, and F. Sebastiani. 1999. Automatic Web page categorization by link and context analysis. In C. Hutchison and G. Lanzarone, editors, Proceedings of THAI-99, European Symposium on Telematics, Hypermedia and Artificial Intelligence, pages 105--119, Varese, Italy.
.... for searching heterogeneous information sources that leverages the use of metadata[17] and automated classification[18] The hyperlink structure of the web can be exploited for automated classification by using the anchor text and other context from linking documents as a source of text features[19]. Approaches to e#cient web spidering[20] 21] have been investigated and are especially important for very large scale crawling e#orts. A complete system for automatically building searchable databases of domain specific web resources using a combination of techniques such as automated ....
G. Attardi, A. Gulli, and F. Sebastiani. Automatic Web Page Categorization by Link and Context Analysis. In Chris Hutchison and Gaetano Lanzarone (eds.), Proceedings of THAI'99, European Symposium on Telematics, Hypermedia and Artificial Intelligence, 105-119, 1999.
.... for searching heterogeneous information sources that leverages the use of metadata[17] and automated classification [18] The hyperlink structure of the web can be exploited for automated classification by using the anchor text and other context from linking documents as a source of text features[19]. Approaches to e#cient web spidering [20] 21] have been investigated and are especially important for very large scale crawling e#orts. A complete system for automatically building searchable databases of domain specific web resources using a John M. Pierre: Practical Issues for Automated ....
G. Attardi, A. Gulli, and F. Sebastiani. Automatic Web Page Categorization by Link and Context Analysis. In Chris Hutchison and Gaetano Lanzarone (eds.), Proceedings of THAI'99, European Symposium on Telematics, Hypermedia and Artificial Intelligence, 105-119, 1999.
.... for searching heterogeneous information sources that leverages the use of metadata[17] and automated classification [18] The hyperlink structure of the web can be exploited for automated classification by using the anchor text and other context from linking documents as a source of text features[19]. Approaches to e#cient web spidering [20] 21] have been investigated and are especially important for very large scale crawling e#orts. A complete system for automatically building searchable databases of domain specific web resources using a combination of techniques such as automated ....
G. Attardi, A. Gulli, and F. Sebastiani. Automatic Web Page Categorization by Link and Context Analysis. In Chris Hutchison and Gaetano Lanzarone (eds.), Proceedings of THAI'99, European Symposium on Telematics, Hypermedia and Artificial Intelligence, 105-119, 1999.
....were unsatisfactory. This was due to the fact that the webpages the bookmarks pointed to were mostly only title pages with links to a number of additional pages that held the actual document. We are planning to implement special web page classification algorithms such as the one presented in (Attardi, Gull i, Sebastiani 1999) as future work. 3 http: www.dmoz.org For now, our testbed consists of user bookmarks which links to documents in PDF of PS format as the user repository. For the community repository, we have two testbeds: the RESEARCHINDEX 4 and a self implemented community repository, managed by the ....
Attardi, G.; Gull`i, A.; and Sebastiani, F. 1999. Automatic web page categorization by link and context analysis. In Hutchison, C., and Lanzarone, G., eds., THAI-99, 1st European Symposium on Telematics, Hypermedia and Artificial Intelligence, p. 105--119.
....# k 2 documents, and one in which category; 8 tered categorisation may be aptest so as to allow new categories to be added and obsolete ones to be deleted. The automatic categorisation of Web pages or sites into Yahoo like hierarchical catalogues is discussed in several recent papers (see e.g. [Attardi et al. 1999; Baker and McCallum 1998; Chakrabarti et al. 1998; McCallum et al. 1998; Mladenic 1998b] and will be more extensively discussed in Section 9. 4. THE MACHINE LEARNING APPROACH TO TEXT CATEGORISATION In the 80s the main approach used to the realisation of automatic document classifiers ....
....selection, classifier induction and evaluation. One of the reasons this application has given rise to specific techniques is that Web pages are special kinds of documents, asthey consist notonly of a text but also of a set of incoming and outgoing pointers. 9. 1Inde4q anddimeY 4ALBBq yre5qqYAL Attardi et al. 1999] propose an indexing technique specific to Web documents which is based on the notion of theblur of a document (see Figure 7) Given a test doc blurb(d) d Machine Learning inAut11498 Text Cat67 risat18 53 Fig. 7. The blurb of d cument d. ument d j , blurb(d j ) is another artificial ....
Attardi G., Gull , A., andSebastiwDz F. 1999. Automatic Web page categorization by linkand context analysis. In C.Hutchiw and G. LanzaroneEdU3 Proceedings ofTHAI99, European Symposium on Telematics, Hypermedia and Artificial Intelligence (Varese, IT, 1999), pp. 105--119.
....the structure of Web documents that refer to them. The overall architecture of the task is described in Figure 1; the subtasks, to be carried out in sequence, are spidering Web documents, HTML structure analysis, URL categorization, weight combination and catalogue update. See the full paper [Attardi 99a] for a detailed description of the adopted algorithm. 4.1 Spidering and HTML Structure Analysis This task starts from a list of URLs, retrieving the documents referred by each of them and analyzing the structure of the document expressed in terms of its HTML tags (for an introduction to HTML ....
Attardi, G., Gull, A., Sebastiani, F.: "Automatic Web Page Categorization by Link and Context Analysis". Manuscript.
....matching, query expansion, are typically used in the inductive construction of the classifiers; 3. IR style evaluation of the e#ectiveness of the classifiers is performed. The various approaches to classification di#er mostly for how they tackle Step 2, although in a few cases (e.g. [2]) non standard approaches to Step 1 are also used. Steps 1, 2 and 3 will be the main themes of Sections 4, 5 and 7, respectively. 4 Indexing and dimensionality reduction In true information retrieval style, each document (either belonging to the initial corpus, or to be categorised in the ....
G. Attardi, A. Gull, and F. Sebastiani. Automatic Web page categorization by link and context analysis. In C. Hutchison and G. Lanzarone, editors, Proceedings of THAI'99, European Symposium on Telematics, Hypermedia and Artificial Intelligence, pages 105--119, Varese, IT, 1999.
No context found.
Attardi, G., Gull, A., Sebastiani, F., Automatic Web Page Categorization by Link and Context Analysis, THAI 1999.
No context found.
Attardi, G., Gull, A., Sebastiani, F.: Automatic Web page categorization by link and context analysis. In Hutchison, C., Lanzarone, G., eds.: Proceedings of THAI-99, 1st European Symposium on Telematics, Hypermedia and Artificial Intelligence, (Varese, IT) 12
No context found.
G. Attardi, A. Gull, and F. Sebastiani. Automatic Web page categorization by link and context analysis. In C. Hutchison and G. Lanzarone, editors, Proceedings of THAI-99, 1st European Symposium on Telematics, Hypermedia and Artificial Intelligence, pages 105--119, Varese, IT, 1999.
No context found.
G. Attardi, A. Gull, and F. Sebastiani. Automatic Web page categorization by link and context analysis. In Proceedings of THAI-99, 1st European Symposium on Telematics, Hypermedia and Artificial Intelligence, 1999.
No context found.
G. Attardi, A. Gull`i, & F. Sebastiani, "Automatic web page categorization by link and context analysis," in THAI-99, #st European Symposium on Telematics, Hypermedia and Artificial Intelligence, C. Hutchison and G. Lanzarone, Eds.,#999, pp. p. #05--##9.
No context found.
G. Attardi, A. Gull, and F. Sebastiani. Automatic Web page categorization by link and context analysis. In Proceedings of THAI-99, 1st European Symposium on Telematics, Hypermedia and Artificial Intelligence, 1999.
No context found.
G. Attardi, A. Gull, and F. Sebastiani. Automatic Web page categorization by link and context analysis. In Proceedings of THAI-99, 1st European Symposium on Telematics, Hypermedia and Artificial Intelligence, 1999.
No context found.
Giuseppe Attardi, Antonio Gull, and Fabrizio Sebastiani. Automatic Web page categorization by link and context analysis. In Proceedings of European Symposium on Telematics, Hypermedia and Artificial Intelligence, pages 105--119, Varese, IT, 1999.
No context found.
G. Attardi, A. Gull, and F. Sebastiani. Automatic Web page categorization by link and context analysis. In Proceedings of THAI-99, 1st European Symposium on Telematics, Hypermedia and Artificial Intelligence, 1999.
No context found.
Giuseppe Attardi, Antonio Gull, and Fabrizio Sebastiani. Automatic Web page categorization by link and context analysis. In Chris Hutchison and Gaetano Lanzarone, editors, Proceedings of THAI'99, European Symposium on Telematics, Hypermedia and Artificial Intelligence, pages 105--119, Varese, IT, 1999. 1
First 50 documents
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC