Results 1 - 10
of
71
Deeper Inside PageRank
- INTERNET MATHEMATICS
, 2004
"... This paper serves as a companion or extension to the “Inside PageRank” paper by Bianchini et al. [Bianchini et al. 03]. It is a comprehensive survey of all issues associated with PageRank, covering the basic PageRank model, available and recommended solution methods, storage issues, existence, uniq ..."
Abstract
-
Cited by 208 (4 self)
- Add to MetaCart
(Show Context)
This paper serves as a companion or extension to the “Inside PageRank” paper by Bianchini et al. [Bianchini et al. 03]. It is a comprehensive survey of all issues associated with PageRank, covering the basic PageRank model, available and recommended solution methods, storage issues, existence, uniqueness, and convergence properties, possible alterations to the basic model, suggested alternatives to the traditional solution methods, sensitivity and conditioning, and finally the updating problem. We introduce a few new results, provide an extensive reference list, and speculate about exciting areas of future research.
Finding advertising keywords on web pages
- In Proceedings of WWW
, 2006
"... A large and growing number of web pages display contextual advertising based on keywords automatically extracted from the text of the page, and this is a substantial source of revenue supporting the web today. Despite the importance of this area, little formal, published research exists. We describe ..."
Abstract
-
Cited by 86 (2 self)
- Add to MetaCart
(Show Context)
A large and growing number of web pages display contextual advertising based on keywords automatically extracted from the text of the page, and this is a substantial source of revenue supporting the web today. Despite the importance of this area, little formal, published research exists. We describe a system that learns how to extract keywords from web pages for advertisement targeting. The system uses a number of features, such as term frequency of each
The connectivity sonar: detecting site functionality by structural patterns
- In Proceedings of the Fourteenth ACM Conference on Hypertext and Hypermedia
, 2003
"... Web sites today serve many different functions, such as corporate sites, search engines, e-stores, and so forth. As sites are created for different purposes, their structure and connectivity characteristics vary. However, this research argues that sites of similar role exhibit similar structural pat ..."
Abstract
-
Cited by 62 (1 self)
- Add to MetaCart
(Show Context)
Web sites today serve many different functions, such as corporate sites, search engines, e-stores, and so forth. As sites are created for different purposes, their structure and connectivity characteristics vary. However, this research argues that sites of similar role exhibit similar structural patterns, as the functionality of a site naturally induces a typical hyperlinked structure and typical connectivity patterns to and from the rest of the Web. Thus, the functionality of Web sites is reflected in a set of structural and connectivity-based features that form a typical signature. In this paper, we automatically categorize sites into eight distinct functional classes, and highlight several search-engine related applications that could make immediate use of such technology. We purposely limit our categorization algorithms by tapping connectivity and structural data alone, making no use of any content analysis whatsoever. When applying two classification algorithms to a set of 202 sites of the eight defined functional categories, the algorithms correctly classified between 54.5 % and 59 % of the sites. On some categories, the precision of the classification exceeded 85%. An additional result of this work indicates that the structural signature can be used to detect spam rings and mirror sites, by clustering sites with almost identical signatures.
Ranking a stream of news
- In WWW ’05: Proceedings of the 14th international conference on World Wide Web
, 2005
"... According to a recent survey made by Nielsen NetRatings, searching on news articles is one of the most important activity online. Indeed, Google, Yahoo, MSN and many others have proposed commercial search engines for indexing news feeds. Despite this commercial interest, no academic research has foc ..."
Abstract
-
Cited by 39 (1 self)
- Add to MetaCart
(Show Context)
According to a recent survey made by Nielsen NetRatings, searching on news articles is one of the most important activity online. Indeed, Google, Yahoo, MSN and many others have proposed commercial search engines for indexing news feeds. Despite this commercial interest, no academic research has focused on ranking a stream of news articles and a set of news sources. In this paper, we introduce this problem by proposing a ranking framework which models: (1) the process of generation of a stream of news articles, (2) the news articles clustering by topics, and (3) the evolution of news story over the time. The ranking algorithm proposed ranks news information, finding the most authoritative news sources and identifying the most interesting events in the different categories to which news article belongs. All these ranking measures take in account the time and can be obtained without a predefined sliding window of observation over the stream. The complexity of our algorithm is linear in the number of pieces of news still under consideration at the time of a new posting. This allow a continuous on-line process of ranking. Our ranking framework is validated on a collection of more than 300,000 pieces of news, produced in two months by more then 2000 news sources belonging to 13 different categories (World, U.S, Europe, Sports, Business, etc). This collection is extracted from the index of comeTo-MyHead, an academic news search engine available online. 1.
A Vector Space Search Engine for Web Services
- In Proceedings of the 3rd European IEEE Conference on Web Services (ECOWS’05
, 2005
"... As Web services increasingly become important in distributed computing, some of the flaws and limitations of this technology become more and more obvious. One of this flaws is the discovery of Web services through common methods. Research has been pursued in the field of ”Semantic Web services”. Thi ..."
Abstract
-
Cited by 33 (10 self)
- Add to MetaCart
(Show Context)
As Web services increasingly become important in distributed computing, some of the flaws and limitations of this technology become more and more obvious. One of this flaws is the discovery of Web services through common methods. Research has been pursued in the field of ”Semantic Web services”. This research is driven by the idea, to describe the functionality of Web services as accurately as possible and to create programs automatically out of already existing Web services. In this paper we discuss a new method for discovery and analysis of Web services. Our approach uses a Vector Space Search Engine to index descriptions of already composed services. Rather than generating or automatically composing applications, this approach provides developers with a valuable utility to browse repositories based on already existing information. Furthermore, we propose some additional modifications to extract the maximum amount of semantics from existing service definition repositories. 1
Revisiting Lexical Signatures to (Re-)Discover Web Pages
- In Proceedings of ECDL ’08
, 2008
"... Abstract. A lexical signature (LS) is a small set of terms derived from a document that capture the “aboutness ” of that document. A LS generated from a web page can be used to discover that page at a different URLaswellastofindrelevantpagesintheInternet.Fromasetofrandomly selected URLs we took all ..."
Abstract
-
Cited by 19 (18 self)
- Add to MetaCart
(Show Context)
Abstract. A lexical signature (LS) is a small set of terms derived from a document that capture the “aboutness ” of that document. A LS generated from a web page can be used to discover that page at a different URLaswellastofindrelevantpagesintheInternet.Fromasetofrandomly selected URLs we took all their copies from the Internet Archive between 1996 and 2007 and generated their LSs. We conducted an overlap analysis of terms in all LSs and found only small overlaps in the early years (1996−2000) but increasing numbers in the more recent past (from 2003 on). We measured the performance of all LSs in dependence of the number of terms they consist of. We found that LSs created more recently perform better than early LSs created between 1996 and 2000. All LSs created from year 2000 on show a similar pattern in their performance curve. Our results show that 5-, 6- and 7-term LSs perform best with returning the URLs of interest in the top ten of the result set. In about 50 % of all cases these URLs are returned as the number one result and in 30 % of all times we considered the URLs as not discoved. 1
Implicit Queries for Email
"... Implicit query systems examine a document and automatically conduct searches for the most relevant information. In this paper, we offer three contributions to implicit query research. First, we show how to use query logs from a search engine: by constraining results to commonly issued queries, we ..."
Abstract
-
Cited by 14 (0 self)
- Add to MetaCart
Implicit query systems examine a document and automatically conduct searches for the most relevant information. In this paper, we offer three contributions to implicit query research. First, we show how to use query logs from a search engine: by constraining results to commonly issued queries, we can get dramatic improvements. Second, we describe a method for optimizing parameters for an implicit query system, by using logistic regression training. The method is designed to estimate the probability that any particular suggested query is a good one. Third, we show which features beyond standard TF-IDF features are most helpful in our logistic regression model: query frequency information, capitalization information, subject line information, and message length information. Using the optimization method and the additional features, we are able to produce a system with up to 6 times better results on top-1 score than a simple TF-IDF system.
Rijke. Linking archives using document enrichment and term selection
- In Proceedings of TPDL 2011
, 2011
"... ..."
(Show Context)
Unweaving a Web of Documents
"... We develop an algorithmic framework to decompose a collection of time-stamped text documents into semantically coherent threads. Our formulation leads to a graph decomposition problem on directed acyclic graphs, for which we obtain three algorithms — an exact algorithm that is based on minimum cost ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
We develop an algorithmic framework to decompose a collection of time-stamped text documents into semantically coherent threads. Our formulation leads to a graph decomposition problem on directed acyclic graphs, for which we obtain three algorithms — an exact algorithm that is based on minimum cost flow and two more efficient algorithms based on maximum matching and dynamic programming that solve specific versions of the graph decomposition problem. Applications of our algorithms include superior summarization of news search results, improved browsing paradigms for large collections of text-intensive corpora, and integration of time-stamped documents from a variety of sources. Experimental results based on over 250,000 news articles from a major newspaper over a period of four years demonstrate that our algorithms efficiently identify robust threads of varying lengths and time-spans.
Feeding the Second Screen: Semantic Linking based on Subtitles (Abstract) ∗
"... Television broadcasts are increasingly consumed on an interactive device or with such a device in the vicinity. Around 70 % of tablet and smartphone owners use their devices while watching television [11]. This allows broadcasters to provide consumers with additional background information that they ..."
Abstract
-
Cited by 9 (3 self)
- Add to MetaCart
(Show Context)
Television broadcasts are increasingly consumed on an interactive device or with such a device in the vicinity. Around 70 % of tablet and smartphone owners use their devices while watching television [11]. This allows broadcasters to provide consumers with additional background information that they may bookmark for later consumption in applications such as depicted in Figure 1. For live television, edited broadcast-specific content to be used on second screens is hard to prepare in advance. We present an approach for automatically generating links to background information in real-time, to be used on second screens. We base our semantic linking approach for television broadcasts on subtitles and Wikipedia, thereby effectively casting the task as one of identifying and generating links for elements in the stream of subtitles. The process of automatically generating links to Wikipedia is