Results 1  10
of
12
Graph Sample and Hold: A Framework for BigGraph Analytics
"... Sampling is a standard approach in biggraph analytics; the goal is to efficiently estimate the graph properties by consulting a sample of the whole population. A perfect sample is assumed to mirror every property of the whole population. Unfortunately, such a perfect sample is hard to collect in c ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
(Show Context)
Sampling is a standard approach in biggraph analytics; the goal is to efficiently estimate the graph properties by consulting a sample of the whole population. A perfect sample is assumed to mirror every property of the whole population. Unfortunately, such a perfect sample is hard to collect in complex populations such as graphs (e.g. web graphs, social networks), where an underlying network connects the units of the population. Therefore, a good sample will be representative in the sense that graph properties of interest can be estimated with a known degree of accuracy. While previous work focused particularly on sampling schemes to estimate certain graph properties (e.g. triangle count), much less is known for the case when we need to estimate various graph properties with the same sampling scheme. In this paper, we propose a generic stream sampling framework for biggraph analytics,
Role Discovery in Networks
, 2014
"... Roles represent nodelevel connectivity patterns such as starcenter, staredge nodes, nearcliques or nodes that act as bridges to different regions of the graph. Intuitively, two nodes belong to the same role if they are struturally similar. Roles have been mainly of interest to sociologists, b ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
Roles represent nodelevel connectivity patterns such as starcenter, staredge nodes, nearcliques or nodes that act as bridges to different regions of the graph. Intuitively, two nodes belong to the same role if they are struturally similar. Roles have been mainly of interest to sociologists, but more recently, roles have become increasingly useful in other domains. Traditionally, the notion of roles were defined based on graph equivalences such as structural, regular, and stochastic equivalences. We briefly revisit the notions and instead propose a more general formulation of roles based on the similarity of a feature representation (in contrast to the graph representation). This leads us to propose a taxonomy of two general classes of techniques for discovering roles which includes (i) graphbased roles and (ii) featurebased roles. This survey focuses primarily on featurebased roles. In particular, we also introduce a flexible framework for discovering roles using the notion of structural similarity on a featurebased representation. The framework consists of two fundamental components: (1) role feature construction and (2) role assignment using the learned feature representation. We discuss the relevant decisions for discovering featurebased roles and highlight the advantages and disadvantages of the many techniques that can be used for this purpose. Finally, we discuss potential applications and future directions and challenges.
Active learning for streaming networked data
 in CIKM, 2014
"... Mining highspeed data streams has become an important topic due to the rapid growth of online data. In this paper, we study the problem of active learning for streaming networked data. The goal is to train an accurate model for classifying networked data that arrives in a streaming manner by query ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Mining highspeed data streams has become an important topic due to the rapid growth of online data. In this paper, we study the problem of active learning for streaming networked data. The goal is to train an accurate model for classifying networked data that arrives in a streaming manner by querying as few labels as possible. The problem is extremely challenging, as both the data distribution and the network structure may change over time. The query decision has to be made for each data instance sequentially, by considering the dynamic network structure. We propose a novel streaming active query strategy based on structural variability. We prove that by querying labels we can monotonically decrease the structural variability and better adapt to concept drift. To speed up the learning process, we present a network sampling algorithm to sample instances from the data stream, which provides a way for us to handle large volume of streaming data. We evaluate the proposed approach on four datasets of different genres: Weibo, Slashdot, IMDB, and ArnetMiner. Experimental results show that our model performs much better (+510% by F1score on average) than several alternative methods for active learning over streaming networked data.
Sampling Representative Users from Large Social Networks
"... Finding a subset of users to statistically represent the original social network is a fundamental issue in Social Network Analysis (SNA). The problem has not been extensively studied in existing literature. In this paper, we present a formal definition of the problem of sampling representative user ..."
Abstract
 Add to MetaCart
Finding a subset of users to statistically represent the original social network is a fundamental issue in Social Network Analysis (SNA). The problem has not been extensively studied in existing literature. In this paper, we present a formal definition of the problem of sampling representative users from social network. We propose two sampling models and theoretically prove their NPhardness. To efficiently solve the two models, we present an efficient algorithm with provable approximation guarantees. Experimental results on two datasets show that the proposed models for sampling representative users significantly outperform (+6%23 % in terms of
ACM SIGKDD Workshop on Interactive Data Exploration and Analytics
"... Proceedings of the ACM SIGKDD Workshop on ..."
(Show Context)
Assortativity in Chung Lu Random Graph Models
, 2014
"... Due to the widespread interest in networks as a representation to investigate the properties of complex systems, there has been a great deal of interest in generative models of graph structure that can capture the properties of networks observed in the real world. Recent models have focused primari ..."
Abstract
 Add to MetaCart
(Show Context)
Due to the widespread interest in networks as a representation to investigate the properties of complex systems, there has been a great deal of interest in generative models of graph structure that can capture the properties of networks observed in the real world. Recent models have focused primarily on accurate characterization of sparse networks with skewed degree distributions, short path lengths, and local clustering. While assortativity—degree correlation among linked nodes—is used as a measure to both describe and evaluate connectivity patterns in networks, there has been little effort to explicitly incorporate patterns of assortativity into model representations. This is because many graph models are edgebased (modeling whether a link should be
International Journal of Modern Physics C © World Scientific Publishing Company Social network sampling using spanning trees
, 2015
"... Due to the large scales and limitations in accessing most online social networks, it is hard or infeasible to directly access them in a reasonable amount of time for studying and analysis. Hence, network sampling has emerged as a suitable technique to study and analyze real networks. The main goal o ..."
Abstract
 Add to MetaCart
Due to the large scales and limitations in accessing most online social networks, it is hard or infeasible to directly access them in a reasonable amount of time for studying and analysis. Hence, network sampling has emerged as a suitable technique to study and analyze real networks. The main goal of sampling online social networks is constructing a small scale sampled network which preserves the most important properties of the original network. In this paper, we propose two sampling algorithms for sampling online social networks using spanning trees. The first proposed sampling algorithm finds several spanning trees from randomly chosen starting nodes; then the edges in these spanning trees are ranked according to the number of times that each edge has appeared in the set of found spanning trees in the given network. The sampled network is then constructed as a subgraph of the original network which contains a fraction of nodes that are incident on highly ranked edges. In order to avoid traversing the entire network, the second sampling algorithm is proposed using partial spanning trees. The second sampling algorithm is similar to the first algorithm except that it uses partial spanning trees. Several experiments are conducted to examine the performance of the proposed sampling algorithms on wellknown real networks. The obtained results in comparison with other popular sampling methods demonstrate the efficiency of
Triadbased Role Discovery for Large Social Systems
"... Abstract. The social role of a participant in a social system conceptualizes the circumstances under which she chooses to interact with others, making their discovery and analysis important for theoretical and practical purposes. In this paper, we propose a methodology to detect such roles by util ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract. The social role of a participant in a social system conceptualizes the circumstances under which she chooses to interact with others, making their discovery and analysis important for theoretical and practical purposes. In this paper, we propose a methodology to detect such roles by utilizing the conditional triad censuses of egonetworks. These censuses are a promising tool for social role extraction because they capture the degree to which basic social forces push upon a user to interact with others in a system. Clusters of triad censuses, inferred from network samples that preserve local structural properties, define the social roles. The approach is demonstrated on two large online interaction networks. 1
Practical Characterization of Large Networks Using Neighborhood Information
"... ar ..."
(Show Context)
arXiv: arXiv:0000.0000 On the Propagation of LowRate Measurement Error to Subgraph Counts in Large, Sparse Networks ∗
"... Abstract: Our work in this paper is motivated by an elementary but also fundamental and highly practical observation – that uncertainty in constructing a network graph Ĝ, as an approximation (or estimate) of some true graph G, manifests as errors in the status of (non)edges that must necessarily pr ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract: Our work in this paper is motivated by an elementary but also fundamental and highly practical observation – that uncertainty in constructing a network graph Ĝ, as an approximation (or estimate) of some true graph G, manifests as errors in the status of (non)edges that must necessarily propagate to any summaries η(G) we seek. Mimicking the common practice of using plugin estimates η(Ĝ) as proxies for η(G), our goal is to characterize the distribution of the discrepencyD = η(Ĝ)−η(G), in the specific case where η(·) is a subgraph count. In the empirically relevant setting of large, sparse graphs with lowrate measurement errors, we demonstrate under an independent and unbiased error model and for the specific case of counting edges that a Poissonlike regime maintains. Specifically, we show that the appropriate limiting distribution is a Skellam distribution, rather than a normal distribution. Next, because dependent errors typically can be expected when counting subgraphs in practice, either at the level of the edges themselves or due to overlap among subgraphs, we develop a parallel formalism for using the Skellam distribution in such cases. In particular, using Stein’s method, we present a series of results leading to the quantification of the accuracy with which the difference of two sums of dependent Bernoulli random variables may be approximated by a Skellam. This formulation is general and likely of some independent interest. We then illustrate the use of these results in our original context of subgraph counts, where we examine (i) the case of counting edges, under a simple dependent error model, and (ii) the case of counting chains of length 2 under an independent error model. We finish with a discussion of various open problems raised by our work.