• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

DMCA

Web Document Clustering: A Feasibility Demonstration (1998)

Cached

  • Download as a PDF

Download Links

  • [www.cs.fiu.edu]
  • [www.cs.ucr.edu]
  • [mainline.brynmawr.edu]
  • [www.cs.ucr.edu]
  • [cs.brynmawr.edu]
  • [www.cs.washington.edu]
  • [www.cs.washington.edu]
  • [web.cacs.louisiana.edu]
  • [homes.cs.washington.edu]
  • [dias.users.greyc.fr]
  • [homes.cs.washington.edu]
  • [homes.cs.washington.edu]
  • [www.cs.washington.edu]

  • Other Repositories/Bibliography

  • DBLP
  • Save to List
  • Add to Collection
  • Correct Errors
  • Monitor Changes
by Oren Zamir , Oren Etzioni
Citations:435 - 3 self
  • Summary
  • Citations
  • Active Bibliography
  • Co-citation
  • Clustered Documents
  • Version History

BibTeX

@INPROCEEDINGS{Zamir98webdocument,
    author = {Oren Zamir and Oren Etzioni},
    title = {Web Document Clustering: A Feasibility Demonstration},
    booktitle = {},
    year = {1998},
    pages = {46--54}
}

Share

Facebook Twitter Reddit Bibsonomy

OpenURL

 

Abstract

Abstract Users of Web search engines are often forced to sift through the long ordered list of document “snippets” returned by the engines. The IR community has explored document clustering as an alternative method of organizing retrieval results, but clustering has yet to be deployed on the major search engines. The paper articulates the unique requirements of Web document clustering and reports on the first evaluation of clustering methods in this domain. A key requirement is that the methods create their clusters based on the short snippets returned by Web search engines. Surprisingly, we find that clusters based on snippets are almost as good as clusters created using the full text of Web documents. To satisfy the stringent requirements of the Web domain, we introduce an incremental, linear time (in the document collection size) algorithm called Suffix Tree Clustering (STC). which creates clusters based on phrases shared between documents. We show that STC is faster than standard clustering methods in this domain, and argue that Web document clustering via STC is both feasible and potentially beneficial. 1

Keyphrases

web document clustering    feasibility demonstration    web search engine    web document    document collection size    full text    alternative method    first evaluation    stringent requirement    linear time    unique requirement    document snippet    short snippet    major search engine    long ordered list    suffix tree clustering    web domain    abstract user    retrieval result    ir community    key requirement   

Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University