Results 1 - 10
of
241
Structure and evolution of online social networks
- In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
, 2006
"... In this paper, we consider the evolution of structure within large online social networks. We present a series of measurements of two such networks, together comprising in excess of five million people and ten million friendship links, annotated with metadata capturing the time of every event in the ..."
Abstract
-
Cited by 400 (4 self)
- Add to MetaCart
(Show Context)
In this paper, we consider the evolution of structure within large online social networks. We present a series of measurements of two such networks, together comprising in excess of five million people and ten million friendship links, annotated with metadata capturing the time of every event in the life of the network. Our measurements expose a surprising segmentation of these networks into three regions: singletons who do not participate in the network; isolated communities which overwhelmingly display star structure; and a giant component anchored by a well-connected core region which persists even in the absence of stars. We present a simple model of network growth which captures these aspects of component structure. The model follows our ex-perimental results, characterizing users as either passive members of the network; inviters who encourage offline friends and acquain-tances to migrate online; and linkers who fully participate in the social evolution of the network.
What's New on the Web? The Evolution of the Web from a Search Engine Perspective
, 2004
"... We seek to gain improved insight into how Web search engines should cope with the evolving Web, in an attempt to provide users with the most up-to-date results possible. For this purpose we collected weekly snapshots of some 150 Web sites over the course of one year, and measured the evolution of co ..."
Abstract
-
Cited by 220 (15 self)
- Add to MetaCart
(Show Context)
We seek to gain improved insight into how Web search engines should cope with the evolving Web, in an attempt to provide users with the most up-to-date results possible. For this purpose we collected weekly snapshots of some 150 Web sites over the course of one year, and measured the evolution of content and link structure. Our measurements focus on aspects of potential interest to search engine designers: the evolution of link structure over time, the rate of creation of new pages and new distinct content on the Web, and the rate of change of the content of existing pages under search-centric measures of degree of change.
Deeper Inside PageRank
- INTERNET MATHEMATICS
, 2004
"... This paper serves as a companion or extension to the “Inside PageRank” paper by Bianchini et al. [Bianchini et al. 03]. It is a comprehensive survey of all issues associated with PageRank, covering the basic PageRank model, available and recommended solution methods, storage issues, existence, uniq ..."
Abstract
-
Cited by 208 (4 self)
- Add to MetaCart
(Show Context)
This paper serves as a companion or extension to the “Inside PageRank” paper by Bianchini et al. [Bianchini et al. 03]. It is a comprehensive survey of all issues associated with PageRank, covering the basic PageRank model, available and recommended solution methods, storage issues, existence, uniqueness, and convergence properties, possible alterations to the basic model, suggested alternatives to the traditional solution methods, sensitivity and conditioning, and finally the updating problem. We introduce a few new results, provide an extensive reference list, and speculate about exciting areas of future research.
Microscopic Evolution of Social Networks
, 2008
"... We present a detailed study of network evolution by analyzing four large online social networks with full temporal information about node and edge arrivals. For the first time at such a large scale, we study individual node arrival and edge creation processes that collectively lead to macroscopic pr ..."
Abstract
-
Cited by 206 (10 self)
- Add to MetaCart
We present a detailed study of network evolution by analyzing four large online social networks with full temporal information about node and edge arrivals. For the first time at such a large scale, we study individual node arrival and edge creation processes that collectively lead to macroscopic properties of networks. Using a methodology based on the maximum-likelihood principle, we investigate a wide variety of network formation strategies, and show that edge locality plays a critical role in evolution of networks. Our findings supplement earlier network models based on the inherently non-local preferential attachment. Based on our observations, we develop a complete model of network evolution, where nodes arrive at a prespecified rate and select their lifetimes. Each node then independently initiates edges according to a “gap” process, selecting a destination for each edge according to a simple triangle-closing model free of any parameters. We show analytically that the combination of the gap distribution with the node lifetime leads to a power law out-degree distribution that accurately reflects the true network in all four cases. Finally, we give model parameter settings that allow automatic evolution and generation of realistic synthetic networks of arbitrary scale.
Spamming botnets: signatures and characteristics
- In SIGCOMM
, 2008
"... In this paper, we focus on characterizing spamming botnets by leveraging both spam payload and spam server traffic properties. Towards this goal, we developed a spam signature generation framework called AutoRE to detect botnet-based spam emails and botnet membership. AutoRE does not require pre-cla ..."
Abstract
-
Cited by 120 (14 self)
- Add to MetaCart
(Show Context)
In this paper, we focus on characterizing spamming botnets by leveraging both spam payload and spam server traffic properties. Towards this goal, we developed a spam signature generation framework called AutoRE to detect botnet-based spam emails and botnet membership. AutoRE does not require pre-classified training data or white lists. Moreover, it outputs high quality regular expression signatures that can detect botnet spam with a low false positive rate. Using a three-month sample of emails from Hotmail, AutoRE successfully identified 7,721 botnet-based spam campaigns together with 340,050 unique botnet host IP addresses. Our in-depth analysis of the identified botnets revealed several interesting findings regarding the degree of email obfuscation, properties of botnet IP addresses, sending patterns, and their correlation with network scanning traffic. We believe these observations are useful information in the design of botnet detection schemes.
Impact of Search Engines on Page Popularity
, 2004
"... Recent studies show that a majority of Web page accesses are referred by search engines. In this paper we study the widespread use of Web search engines and its impact on the ecology of the Web. In particular, we study how much impact search engines have on the popularity evolution of Web pages. For ..."
Abstract
-
Cited by 99 (1 self)
- Add to MetaCart
(Show Context)
Recent studies show that a majority of Web page accesses are referred by search engines. In this paper we study the widespread use of Web search engines and its impact on the ecology of the Web. In particular, we study how much impact search engines have on the popularity evolution of Web pages. For example, given that search engines return currently "popular" pages at the top of search results, are we somehow penalizing newly created pages that are not very well known yet? Are popular pages getting even more popular and new pages completely ignored? We first show that this unfortunate trend indeed exists on the Web through an experimental study based on real Web data. We then analytically estimate how much longer it takes for a new page to attract a large number of Web users when search engines return only popular pages at the top of search results. Our result shows that search engines can have an immensely worrisome impact on the discovery of new Web pages.
SpamRank -- Fully Automatic Link Spam Detection
- IN PROCEEDINGS OF THE FIRST INTERNATIONAL WORKSHOP ON ADVERSARIAL INFORMATION RETRIEVAL ON THE WEB (AIRWEB
, 2005
"... Spammers intend to increase the PageRank of certain spam pages by creating a large number of links pointing to them. We propose a novel method based on the concept of personalized PageRank that detects pages with an undeserved high PageRank value without the need of any kind of white or blacklists ..."
Abstract
-
Cited by 96 (5 self)
- Add to MetaCart
(Show Context)
Spammers intend to increase the PageRank of certain spam pages by creating a large number of links pointing to them. We propose a novel method based on the concept of personalized PageRank that detects pages with an undeserved high PageRank value without the need of any kind of white or blacklists or other means of human intervention. We assume that spammed pages have a biased distribution of pages that contribute to the undeserved high PageRank value. We define SpamRank by penalizing pages that originate a suspicious PageRank share and personalizing PageRank on the penalties. Our method is tested on a 31 M page crawl of the .de domain with a manually classified 1000-page stratified random sample with bias towards large PageRank values.
On the Evolution of Clusters of Near-Duplicate Web Pages
- IN 1ST LATIN AMERICAN WEB CONGRESS
, 2003
"... This paper expands on a 1997 study of the amount and distribution of near-duplicate pages on the World Wide Web. We downloaded a set of 150 million web pages on a weekly basis over the span of 11 weeks. We then determined which of these pages are near-duplicates of one another, and tracked how clust ..."
Abstract
-
Cited by 63 (3 self)
- Add to MetaCart
(Show Context)
This paper expands on a 1997 study of the amount and distribution of near-duplicate pages on the World Wide Web. We downloaded a set of 150 million web pages on a weekly basis over the span of 11 weeks. We then determined which of these pages are near-duplicates of one another, and tracked how clusters of near-duplicate documents evolved over time. We found that 29.2% of all web pages are very similar to other pages, and that 22.2% are virtually identical to other pages. We also found that clusters of near-duplicate documents are fairly stable: Two documents that are near-duplicates of one another are very likely to still be near-duplicates 10 weeks later. This result is of significant relevance to search engines: Web crawlers can be fairly confident that two pages that have been found to be near-duplicates of one another will continue to be so for the foreseeable future, and may thus decide to recrawl only one version of that page, or at least to lower the download priority of the other versions, thereby freeing up crawling resources that can be brought to bear more productively somewhere else.
Sic Transit Gloria Telae: Towards an Understanding of the Web's Decay
- In Proceedings of the 13th conference on World Wide Web
, 2004
"... The rapid growth of the web has been noted and tracked extensively. Recent studies have however documented the dual phenomenon: web pages have small half lives, and thus the web exhibits rapid death as well. Consequently, page creators are faced with an increasingly burdensome task of keeping links ..."
Abstract
-
Cited by 63 (0 self)
- Add to MetaCart
(Show Context)
The rapid growth of the web has been noted and tracked extensively. Recent studies have however documented the dual phenomenon: web pages have small half lives, and thus the web exhibits rapid death as well. Consequently, page creators are faced with an increasingly burdensome task of keeping links up-to-date, and many are falling behind. In addition to just individual pages, collections of pages or even entire neighborhoods of the web exhibit significant decay, rendering them less e#ective as information resources. Such neighborhoods are identified only by frustrated searchers, seeking a way out of these stale neighborhoods, back to more up-to-date sections of the web; measuring the decay of a page purely on the basis of dead links on the page is too naive to reflect this frustration. In this paper we formalize a strong notion of a decay measure and present algorithms for computing it e#ciently. We explore this measure by presenting a number of validations, and use it to identify interesting artifacts on today's web. We then describe a number of applications of such a measure to search engines, web page maintainers, ontologists, and individual users.
The connectivity sonar: detecting site functionality by structural patterns
- In Proceedings of the Fourteenth ACM Conference on Hypertext and Hypermedia
, 2003
"... Web sites today serve many different functions, such as corporate sites, search engines, e-stores, and so forth. As sites are created for different purposes, their structure and connectivity characteristics vary. However, this research argues that sites of similar role exhibit similar structural pat ..."
Abstract
-
Cited by 62 (1 self)
- Add to MetaCart
(Show Context)
Web sites today serve many different functions, such as corporate sites, search engines, e-stores, and so forth. As sites are created for different purposes, their structure and connectivity characteristics vary. However, this research argues that sites of similar role exhibit similar structural patterns, as the functionality of a site naturally induces a typical hyperlinked structure and typical connectivity patterns to and from the rest of the Web. Thus, the functionality of Web sites is reflected in a set of structural and connectivity-based features that form a typical signature. In this paper, we automatically categorize sites into eight distinct functional classes, and highlight several search-engine related applications that could make immediate use of such technology. We purposely limit our categorization algorithms by tapping connectivity and structural data alone, making no use of any content analysis whatsoever. When applying two classification algorithms to a set of 202 sites of the eight defined functional categories, the algorithms correctly classified between 54.5 % and 59 % of the sites. On some categories, the precision of the classification exceeded 85%. An additional result of this work indicates that the structural signature can be used to detect spam rings and mirror sites, by clustering sites with almost identical signatures.