Results 1  10
of
256
Statistical properties of community structure in large social and information networks
"... A large body of work has been devoted to identifying community structure in networks. A community is often though of as a set of nodes that has more connections between its members than to the remainder of the network. In this paper, we characterize as a function of size the statistical and structur ..."
Abstract

Cited by 242 (14 self)
 Add to MetaCart
(Show Context)
A large body of work has been devoted to identifying community structure in networks. A community is often though of as a set of nodes that has more connections between its members than to the remainder of the network. In this paper, we characterize as a function of size the statistical and structural properties of such sets of nodes. We define the network community profile plot, which characterizes the “best ” possible community—according to the conductance measure—over a wide range of size scales, and we study over 70 large sparse realworld networks taken from a wide range of application domains. Our results suggest a significantly more refined picture of community structure in large realworld networks than has been appreciated previously. Our most striking finding is that in nearly every network dataset we examined, we observe tight but almost trivial communities at very small scales, and at larger size scales, the best possible communities gradually “blend in ” with the rest of the network and thus become less “communitylike.” This behavior is not explained, even at a qualitative level, by any of the commonlyused network generation models. Moreover, this behavior is exactly the opposite of what one would expect based on experience with and intuition from expander graphs, from graphs that are wellembeddable in a lowdimensional structure, and from small social networks that have served as testbeds of community detection algorithms. We have found, however, that a generative model, in which new edges are added via an iterative “forest fire” burning process, is able to produce graphs exhibiting a network community structure similar to our observations.
Microscopic Evolution of Social Networks
, 2008
"... We present a detailed study of network evolution by analyzing four large online social networks with full temporal information about node and edge arrivals. For the first time at such a large scale, we study individual node arrival and edge creation processes that collectively lead to macroscopic pr ..."
Abstract

Cited by 202 (9 self)
 Add to MetaCart
We present a detailed study of network evolution by analyzing four large online social networks with full temporal information about node and edge arrivals. For the first time at such a large scale, we study individual node arrival and edge creation processes that collectively lead to macroscopic properties of networks. Using a methodology based on the maximumlikelihood principle, we investigate a wide variety of network formation strategies, and show that edge locality plays a critical role in evolution of networks. Our findings supplement earlier network models based on the inherently nonlocal preferential attachment. Based on our observations, we develop a complete model of network evolution, where nodes arrive at a prespecified rate and select their lifetimes. Each node then independently initiates edges according to a “gap” process, selecting a destination for each edge according to a simple triangleclosing model free of any parameters. We show analytically that the combination of the gap distribution with the node lifetime leads to a power law outdegree distribution that accurately reflects the true network in all four cases. Finally, we give model parameter settings that allow automatic evolution and generation of realistic synthetic networks of arbitrary scale.
Community structure in large networks: Natural cluster sizes and the absence of large welldefined clusters
, 2008
"... A large body of work has been devoted to defining and identifying clusters or communities in social and information networks, i.e., in graphs in which the nodes represent underlying social entities and the edges represent some sort of interaction between pairs of nodes. Most such research begins wit ..."
Abstract

Cited by 198 (17 self)
 Add to MetaCart
(Show Context)
A large body of work has been devoted to defining and identifying clusters or communities in social and information networks, i.e., in graphs in which the nodes represent underlying social entities and the edges represent some sort of interaction between pairs of nodes. Most such research begins with the premise that a community or a cluster should be thought of as a set of nodes that has more and/or better connections between its members than to the remainder of the network. In this paper, we explore from a novel perspective several questions related to identifying meaningful communities in large social and information networks, and we come to several striking conclusions. Rather than defining a procedure to extract sets of nodes from a graph and then attempt to interpret these sets as a “real ” communities, we employ approximation algorithms for the graph partitioning problem to characterize as a function of size the statistical and structural properties of partitions of graphs that could plausibly be interpreted as communities. In particular, we define the network community profile plot, which characterizes the “best ” possible community—according to the conductance measure—over a wide range of size scales. We study over 100 large realworld networks, ranging from traditional and online social networks, to technological and information networks and
Kronecker Graphs: An Approach to Modeling Networks
 JOURNAL OF MACHINE LEARNING RESEARCH 11 (2010) 9851042
, 2010
"... How can we generate realistic networks? In addition, how can we do so with a mathematically tractable model that allows for rigorous analysis of network properties? Real networks exhibit a long list of surprising properties: Heavy tails for the in and outdegree distribution, heavy tails for the ei ..."
Abstract

Cited by 122 (3 self)
 Add to MetaCart
How can we generate realistic networks? In addition, how can we do so with a mathematically tractable model that allows for rigorous analysis of network properties? Real networks exhibit a long list of surprising properties: Heavy tails for the in and outdegree distribution, heavy tails for the eigenvalues and eigenvectors, small diameters, and densification and shrinking diameters over time. Current network models and generators either fail to match several of the above properties, are complicated to analyze mathematically, or both. Here we propose a generative model for networks that is both mathematically tractable and can generate networks that have all the above mentioned structural properties. Our main idea here is to use a nonstandard matrix operation, the Kronecker product, to generate graphs which we refer to as “Kronecker graphs”. First, we show that Kronecker graphs naturally obey common network properties. In fact, we rigorously prove that they do so. We also provide empirical evidence showing that Kronecker graphs can effectively model the structure of real networks. We then present KRONFIT, a fast and scalable algorithm for fitting the Kronecker graph generation model to large real networks. A naive approach to fitting would take superexponential
PowerGraph: Distributed GraphParallel Computation on Natural Graphs
"... Largescale graphstructured computation is central to tasks ranging from targeted advertising to natural language processing and has led to the development of several graphparallel abstractions including Pregel and GraphLab. However, the natural graphs commonly found in the realworld have highly ..."
Abstract

Cited by 117 (4 self)
 Add to MetaCart
(Show Context)
Largescale graphstructured computation is central to tasks ranging from targeted advertising to natural language processing and has led to the development of several graphparallel abstractions including Pregel and GraphLab. However, the natural graphs commonly found in the realworld have highly skewed powerlaw degree distributions, which challenge the assumptions made by these abstractions, limiting performance and scalability. In this paper, we characterize the challenges of computation on natural graphs in the context of existing graphparallel abstractions. We then introduce the PowerGraph abstraction which exploits the internal structure of graph programs to address these challenges. Leveraging the PowerGraph abstraction we introduce a new approach to distributed graph placement and representation that exploits the structure of powerlaw graphs. We provide a detailed analysis and experimental evaluation comparing PowerGraph to two popular graphparallel systems. Finally, we describe three different implementation strategies for PowerGraph and discuss their relative merits with empirical evaluations on largescale realworld problems demonstrating order of magnitude gains. 1
Inferring Networks of Diffusion and Influence
, 2010
"... Information diffusion and virus propagation are fundamental processes talking place in networks. While it is often possible to directly observe when nodes become infected, observing individual transmissions (i.e., who infects whom or who influences whom) is typically very difficult. Furthermore, in ..."
Abstract

Cited by 112 (13 self)
 Add to MetaCart
Information diffusion and virus propagation are fundamental processes talking place in networks. While it is often possible to directly observe when nodes become infected, observing individual transmissions (i.e., who infects whom or who influences whom) is typically very difficult. Furthermore, in many applications, the underlying network over which the diffusions and propagations spread is actually unobserved. We tackle these challenges by developing a method for tracing paths of diffusion and influence through networks and inferring the networks over which contagions propagate. Given the times when nodes adopt pieces of information or become infected, we identify the optimal network that best explains the observed infection times. Since the optimization problem is NPhard to solve exactly, we develop an efficient approximation algorithm that scales to large datasets and in practice gives provably nearoptimal performance. We demonstrate the effectiveness of our approach by tracing information cascades in a set of 170 million blogs and news articles over a one year period to infer how information flows through the online media space. We find that the diffusion network of news tends to have a coreperiphery structure with a small set of core media sites that diffuse information to the rest of the Web. These sites tend to have stable circles of influence with more general news media sites acting as connectors between them.
No Free Lunch in Data Privacy
"... Differential privacy is a powerful tool for providing privacypreserving noisy query answers over statistical databases. It guarantees that the distribution of noisy query answers changes very little with the addition or deletion of any tuple. It is frequently accompanied by popularized claims that i ..."
Abstract

Cited by 73 (5 self)
 Add to MetaCart
(Show Context)
Differential privacy is a powerful tool for providing privacypreserving noisy query answers over statistical databases. It guarantees that the distribution of noisy query answers changes very little with the addition or deletion of any tuple. It is frequently accompanied by popularized claims that it provides privacy without any assumptions about the data and that it protects against attackers who know all but one record. In this paper we critically analyze the privacy protections offered by differential privacy. First, we use a nofreelunch theorem, which defines nonprivacy as a game, to argue that it is not possible to provide privacy and utility without making assumptions about how the data are generated. Then we explain where assumptions are needed. We argue that privacy of an individual is preserved when it is possible to limit the inference of an attacker about the participation of the individual in the data generating process. This is different from limiting the inference about the presence of a tuple (for example, Bob’s participation in a social network may cause edges to form between pairs of his friends, so that it affects more than just the tuple labeled as “Bob”). The definition of evidence of participation, in turn, depends on how the data are generated – this is how assumptions enter the picture. We explain these ideas using examples from social network research as well as tabular data for which deterministic statistics have been previously released. In both cases the notion of participation varies, the use of differential privacy can lead to privacy breaches, and differential privacy does not always adequately limit inference about participation.
The little engine(s) that could: scaling online social networks
 in ACM SIGCOMM Conference, 2010
"... The difficulty of scaling Online Social Networks (OSNs) has introduced new system design challenges that has often caused costly rearchitecting for services like Twitter and Facebook. The complexity of interconnection of users in social networks has introduced new scalability challenges. Convention ..."
Abstract

Cited by 67 (5 self)
 Add to MetaCart
(Show Context)
The difficulty of scaling Online Social Networks (OSNs) has introduced new system design challenges that has often caused costly rearchitecting for services like Twitter and Facebook. The complexity of interconnection of users in social networks has introduced new scalability challenges. Conventional vertical scaling by resorting to full replication can be a costly proposition. Horizontal scaling by partitioning and distributing data among multiples servers – e.g. using DHTs – can lead to costly interserver communication. We design, implement, and evaluate SPAR, a social partitioning and replication middleware that transparently leverages the social graph structure to achieve data locality while minimizing replication. SPAR guarantees that for all users in an OSN, their direct neighbor’s data is colocated in the same server. The gains from this approach are multifold: application developers can assume local semantics, i.e., develop as they would for a single server; scalability is achieved by adding commodity servers with low memory and network I/O requirements; and redundancy is achieved at a fraction of the cost. We detail our system design and an evaluation based on datasets from Twitter, Orkut, and Facebook, with a working implementation. We show that SPAR incurs minimum overhead, and can help a wellknown opensource Twitter clone reach Twitter’s scale without changing a line of its application logic and achieves higher throughput than Cassandra, Facebook’s DHT based keyvalue store database.
Overlapping community detection in networks: the state of the art and comparative study
 ACM Comput. Surv
, 2012
"... This paper reviews the state of the art in overlapping community detection algorithms, quality measures, and benchmarks. A thorough comparison of different algorithms (a total of fourteen) is provided. In addition to community level evaluation, we propose a framework for evaluating algorithms ’ abil ..."
Abstract

Cited by 64 (6 self)
 Add to MetaCart
(Show Context)
This paper reviews the state of the art in overlapping community detection algorithms, quality measures, and benchmarks. A thorough comparison of different algorithms (a total of fourteen) is provided. In addition to community level evaluation, we propose a framework for evaluating algorithms ’ ability to detect overlapping nodes, which helps to assess overdetection and underdetection. After considering community level detection performance measured by Normalized Mutual Information, the Omega index, and node level detection performance measured by Fscore, we reached the following conclusions. For low overlapping density networks, SLPA, OSLOM, Game and COPRA offer better performance than the other tested algorithms. For networks with high overlapping density and high overlapping diversity, both SLPA and Game provide relatively stable performance. However, test results also suggest that the detection in such networks is still not yet fully resolved. A common feature observed by various algorithms in realworld networks is the relatively small fraction of overlapping nodes (typically less than 30%), each of which belongs to only 2 or 3 communities.
Timevarying graphs and dynamic networks
 International Journal of Parallel, Emergent and Distributed Systems
"... The past few years have seen intensive research efforts carried out in some apparently unrelated areas of dynamic systems – delaytolerant networks, opportunisticmobility networks, social networks – obtaining closely related insights. Indeed, the concepts discovered in these investigations can be v ..."
Abstract

Cited by 61 (20 self)
 Add to MetaCart
(Show Context)
The past few years have seen intensive research efforts carried out in some apparently unrelated areas of dynamic systems – delaytolerant networks, opportunisticmobility networks, social networks – obtaining closely related insights. Indeed, the concepts discovered in these investigations can be viewed as parts of the same conceptual universe; and the formal models proposed so far to express some specific concepts are components of a larger formal description of this universe. The main contribution of this paper is to integrate the vast collection of concepts, formalisms, and results found in the literature into a unified framework, which we call TVG (for timevarying graphs). Using this framework, it is possible to express directly in the same formalism not only the concepts common to all those different areas, but also those specific to each. Based on this definitional work, employing both existing results and original observations, we present a hierarchical classification of TVGs; each class corresponds to a significant property examined in the distributed computing literature. We then examine how TVGs can be used to study the evolution of network properties, and propose different techniques, depending on whether the indicators for these properties are atemporal (as in the majority of existing studies) or temporal. Finally, we briefly discuss the introduction of randomness in TVGs.