See this document in CiteSeerX!

Using URLs and Table Layout for Web Classification Tasks (2004)  (Make Corrections)  (3 citations)
Lawrence Kai Shih and David R. Karger Massachusetts Institute of Technology...



  Home/Search   Context   Related

 
View or download:
www2004.org/proceedings/doc...1p193.pdf
Cached:  PS.gz  PS  PDF   Image  Update  Help

From:  www2004.org/proceeding...contents (more)
(Enter author homepages)

Rate this article: (best)
  Comment on this article  
(Enter summary)

Abstract: We propose new features and algorithms for automating Web-page classification tasks such as content recommendation and ad blocking. We show that the automated classification of Web pages can be much improved if, instead of looking at their textual content, we consider each links's URL and the visual placement of those links on a referring page. These features are unusual: rather than being scalar measurements like word counts they are tree structured--- describing the position of the item in a... (Update)

Cited by:   More
Thresher: Automating the Unwrapping of Semantic - Content From The (2005)   (Correct)
Thresher: Automating the Unwrapping of Semantic Content from.. - Hogue, Karger (2005)   (Correct)

Active bibliography (related documents):   More   All
0.7:   Classification Techniques for Categorization of Hypertext Documents - Arumugam   (Correct)
0.3:   Sentence Extraction by tf/idf and Position Weighting from.. - Seki   (Correct)
0.3:   Sentence Alignment for Monolingual Comparable Corpora - Regina Barzilay Cornell (2003)   (Correct)

Similar documents based on text:   More   All
0.2:   Belief Layer for Haystack - Zhurakhinskaya (2002)   (Correct)
0.1:   An Experimental Study of Poly-Logarithmic.. - Iyer, Jr., Karger.. (2000)   (Correct)
0.1:   Experimental Study of Minimum Cut Algorithms - Chekuri, Goldberg, Karger.. (1997)   (Correct)

Related documents from co-citation:   More   All
3:   Annotea: An Open RDF Infrastructure for Shared Web Annotations - Kahan, Koivunen et al. - 2001
3:   Wrapper induction for information extraction - Kushmerick, Weld et al. - 1997
3:   New Tools for the Semantic Web (context) - Golbeck, Grove et al. - 2002

BibTeX entry:   (Update)

L. K. Shih and D. Karger. Using URLs and table layout for web classification tasks. In Proceedings of the 13th International Conference on the World Wide Web, pages 193--202, New York, NY, 2004. http://citeseer.ist.psu.edu/shih04using.html   More

@misc{ shih04using,
  author = "L. Shih and D. Karger",
  title = "Using URLs and table layout for web classification tasks",
  text = "L. K. Shih and D. Karger. Using URLs and table layout for web classification
    tasks. In Proceedings of the 13th International Conference on the World
    Wide Web, pages 193--202, New York, NY, 2004.",
  year = "2004",
  url = "citeseer.ist.psu.edu/shih04using.html" }
Citations (may not include all citations):
641   The anatomy of a large-scale hypertextual Web search engine - Brin, Page - 1998
431   A tutorial on support vector machines for pattern recognitio.. - Burges - 1998
228   Wrapper induction for information extraction - Kushmerick, Weld et al. - 1997
149   Quantifying inductive bias: AI learning algorithms and Valia.. (context) - Haussler - 1988
135   Hierarchically classifying documents using very few words - Koller, Sahami - 1997
130   A probabilistic analysis of the rocchio algorithm with tfidf.. - Joachims - 1997
91   Learning and revising user profiles: The identification of i.. - Pazzani, Billsus - 1997
65   On integrating catalogs - Agrawal, Srikant - 2001
61   Improving text classification by shrinkage in a hierarchy of.. - McCallum, Rosenfeld et al. - 1998
49   Wiley and Sons (context) - Duda, Hart et al. - 1973
42   Nonparametric Statistical Methods (context) - Hollander, Wolfe - 1973
41   Using reinforcement learning to spider the Web efficiently - Rennie, McCallum - 1999
26   A hybrid user model for news story classification - Billsus, Pazzani - 1999
19   generative classifiers: A comparison of logistic regression .. (context) - Ng, Jordan et al. - 2002
15   Accelerated focused crawling through online relevance feedba.. - Chakrabarti, Punera et al. - 2002
13   Web montage: a dynamic personalized start page (context) - Anderson, Horvitz - 2002
12   Learning to remove internet advertisement - Kushmerick - 1999
9   httpfive percent nation (context) - http, nation et al. - 2000
3   Inferring strategies for sentence ordering in multidocument .. - Barzilay, Elhadad et al. - 2002
1   Massachusetts Institute of Technology (context) - Shih, of et al. - 2004

Documents on the same site (http://www.www2004.org/proceedings/docs/contents.htm):   More
A Community-Aware Search Engine - Almeida, Almeida (2004)   (Correct)
HuskySim: A Simulation Toolkit for Application Scheduling.. - Kerasha, Greenshields (2004)   (Correct)
Reactive Rules Inference from Dynamic Dependency Models - Adi, Etzion, Gilat.. (2004)   (Correct)

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC