Results 1 - 10
of
637
Efficient Identification of Web Communities
- IN SIXTH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING
, 2000
"... We define a community on the web as a set of sites that have more links (in either direction) to members of the community than to non-members. Members of such a community can be eciently identified in a maximum flow / minimum cut framework, where the source is composed of known members, and the sink ..."
Abstract
-
Cited by 293 (13 self)
- Add to MetaCart
We define a community on the web as a set of sites that have more links (in either direction) to members of the community than to non-members. Members of such a community can be eciently identified in a maximum flow / minimum cut framework, where the source is composed of known members, and the sink consists of well-known non-members. A focused crawler that crawls to a fixed depth can approximate community membership by augmenting the graph induced by the crawl with links to a virtual sink node. The effectiveness of the approximation algorithm is demonstrated with several crawl results that identify hubs, authorities, web rings, and other link topologies that are useful but not easily categorized. Applications of our approach include focused crawlers and search engines, automatic population of portal categories, and improved filtering.
The Evolution of the Web and Implications for an Incremental Crawler
, 1999
"... In this paper we study how to build an effective incremental crawler. The crawler selectively and incrementally updates its index and/or local collection of web pages, instead of periodically refreshing the collection in batch mode. The incremental crawler can improve the "freshness" of th ..."
Abstract
-
Cited by 281 (18 self)
- Add to MetaCart
In this paper we study how to build an effective incremental crawler. The crawler selectively and incrementally updates its index and/or local collection of web pages, instead of periodically refreshing the collection in batch mode. The incremental crawler can improve the "freshness" of the collection significantly and bring in new pages in a more timely manner. We first present results from an experiment conducted on more than half million web pages over 4 months, to estimate how web pages evolve over time. Based on these experimental results, we compare various design choices for an incremental crawler and discuss their trade-offs. We propose an architecture for the incremental crawler, which combines the best design choices.
Crawling the Hidden Web
- In VLDB
, 2001
"... Current-day crawlers retrieve content only from the publicly indexable Web, i.e., the set of web pages reachable purely by following hypertext links, ignoring search forms and pages that require authorization or prior registration. ..."
Abstract
-
Cited by 279 (2 self)
- Add to MetaCart
(Show Context)
Current-day crawlers retrieve content only from the publicly indexable Web, i.e., the set of web pages reachable purely by following hypertext links, ignoring search forms and pages that require authorization or prior registration.
Focused crawling using context graphs
- In 26th International Conference on Very Large Databases, VLDB 2000
, 2000
"... diligmic,gori¢ Maintaining currency of search engine indices by exhaustive crawling is rapidly becoming impossible due to the increasing size and dynamic content of the web. Focused crawlers aim to search only the subset of the web related to a specific category, and offer a potential solution to th ..."
Abstract
-
Cited by 255 (11 self)
- Add to MetaCart
(Show Context)
diligmic,gori¢ Maintaining currency of search engine indices by exhaustive crawling is rapidly becoming impossible due to the increasing size and dynamic content of the web. Focused crawlers aim to search only the subset of the web related to a specific category, and offer a potential solution to the currency problem. The major problem in focused crawling is performing appropriate credit assignment to different documents along a crawl path, such that short-term gains are not pursued at the expense of less-obvious crawl paths that ultimately yield larger sets of valuable pages. To address this problem we present a focused crawling algorithm that builds a model for the context within which topically relevant pages occur on the web. This context model can capture typical link hierarchies within which valuable pages occur, as well as model content on documents that frequently cooccur with relevant pages. Our algorithm further leverages the existing capability of large search engines to provide partial reverse crawling capabilities. Our algorithm shows significant performance improvements in crawling efficiency over standard focused crawling. 1
Scaling Question Answering to the Web
, 2001
"... The wealth of information on the web makes it an attractive resource for seeking quick answers to simple, factual questions such as "who was the first American in space?" or "what is the second tallest mountain in the world?" Yet today's most advanced web search services (e. ..."
Abstract
-
Cited by 238 (16 self)
- Add to MetaCart
The wealth of information on the web makes it an attractive resource for seeking quick answers to simple, factual questions such as "who was the first American in space?" or "what is the second tallest mountain in the world?" Yet today's most advanced web search services (e.g., Google and AskJeeves) make it surprisingly tedious to locate answers to such questions. In this paper, we extend question-answering techniques, first studied in the information retrieval literature, to the web and experimentally evaluate their performance. First we introduce MULDER, which we believe to be the first general-purpose, fully-automated question-answering system available on the web. Second, we describe MULDER's architecture, which relies on multiple search-engine queries, natural-language parsing, and a novel voting procedure to yield reliable answers coupled with high recall. Finally, we compare MULDER's performance to that of Google and AskJeeves on questions drawn from the TREC-8 question track. We find that MULDER's recall is more than a factor of three higher than that of AskJeeves. In addition, we find that Google requires 6.6 times as much user effort to achieve the same level of recall as MULDER. 1.
Self-Organization and Identification of Web Communities
- IEEE Computer
, 2002
"... Despite the decentralized and unorganized nature of the web, we show that the web self-organizes such that communities of highly related pages can be efficiently identified based purely on connectivity. ..."
Abstract
-
Cited by 211 (0 self)
- Add to MetaCart
(Show Context)
Despite the decentralized and unorganized nature of the web, we show that the web self-organizes such that communities of highly related pages can be efficiently identified based purely on connectivity.
Automating the Construction of Internet Portals with Machine Learning
- Information Retrieval
, 2000
"... Domain-specific internet portals are growing in popularity because they gather content from the Web and organize it for easy access, retrieval and search. For example, www.campsearch.com allows complex queries by age, location, cost and specialty over summer camps. This functionality is not possible ..."
Abstract
-
Cited by 208 (4 self)
- Add to MetaCart
Domain-specific internet portals are growing in popularity because they gather content from the Web and organize it for easy access, retrieval and search. For example, www.campsearch.com allows complex queries by age, location, cost and specialty over summer camps. This functionality is not possible with general, Web-wide search engines. Unfortunately these portals are difficult and time-consuming to maintain. This paper advocates the use of machine learning techniques to greatly automate the creation and maintenance of domain-specific Internet portals. We describe new research in reinforcement learning, information extraction and text classification that enables efficient spidering, the identification of informative text segments, and the population of topic hierarchies. Using these techniques, we have built a demonstration system: a portal for computer science research papers. It already contains over 50,000 papers and is publicly available at www.cora.justresearch.com. These techniques are ...
Synchronizing a database to Improve Freshness
, 1999
"... In this paper we study how to refresh a local copy of an autonomous data source to maintain the copy up-to-date. As the size of the data grows, it becomes more di#cult to maintain the copy "fresh," making it crucial to synchronize the copy e#ectively. We define two freshness metrics, chang ..."
Abstract
-
Cited by 195 (17 self)
- Add to MetaCart
(Show Context)
In this paper we study how to refresh a local copy of an autonomous data source to maintain the copy up-to-date. As the size of the data grows, it becomes more di#cult to maintain the copy "fresh," making it crucial to synchronize the copy e#ectively. We define two freshness metrics, change models of the underlying data, and synchronization policies. We analytically study how effective the various policies are. We also experimentally verify our analysis, based on data collected from 270 web sites for more than 4 months, and we show that our new policy improves the "freshness" very significantly compared to current policies in use.
Small-World Phenomena and the Dynamics of Information
- In Advances in Neural Information Processing Systems (NIPS) 14
, 2001
"... Introduction The problem of searching for information in networks like the World Wide Web can be approached in a variety of ways, ranging from centralized indexing schemes to decentralized mechanisms that navigate the underlying network without knowledge of its global structure. The decentralized ap ..."
Abstract
-
Cited by 177 (5 self)
- Add to MetaCart
(Show Context)
Introduction The problem of searching for information in networks like the World Wide Web can be approached in a variety of ways, ranging from centralized indexing schemes to decentralized mechanisms that navigate the underlying network without knowledge of its global structure. The decentralized approach appears in a variety of settings: in the behavior of users browsing the Web by following hyperlinks; in the design of focused crawlers [4, 5, 8] and other agents that explore the Web's links to gather information; and in the search protocols underlying decentralized peer-to-peer systems such as Gnutella [10], Freenet [7], and recent research prototypes [21, 22, 23], through which users can share resources without a central server. In recent work, we have been investigating the problem of decentralized search in large information networks [14, 15]. Our initial motivation was an experiment that dealt directly with the search problem in a decidedly pre-Internet context: Stanley Milgram
Extrapolation Methods for Accelerating PageRank Computations
- In Proceedings of the Twelfth International World Wide Web Conference
, 2003
"... We present a novel algorithm for the fast computation of PageRank, a hyperlink-based estimate of the "importance" of Web pages. The original PageRank algorithm uses the Power Method to compute successive iterates that converge to the principal eigenvector of the Markov matrix representing ..."
Abstract
-
Cited by 167 (12 self)
- Add to MetaCart
(Show Context)
We present a novel algorithm for the fast computation of PageRank, a hyperlink-based estimate of the "importance" of Web pages. The original PageRank algorithm uses the Power Method to compute successive iterates that converge to the principal eigenvector of the Markov matrix representing the Web link graph. The algorithm presented here, called Quadratic Extrapolation, accelerates the convergence of the Power Method by periodically subtracting off estimates of the nonprincipal eigenvectors from the current iterate of the Power Method. In Quadratic Extrapolation, we take advantage of the fact that the first eigenvalueof a Markov matrix is known to be 1 to compute the nonprincipal eigenvectorsusing successiveiterates of the Power Method. Empirically, we show that using Quadratic Extrapolation speeds up PageRank computation by 50-300% on a Web graph of 80 million nodes, with minimal overhead.