Results 1  10
of
106
Matrix Approximation and Projective Clustering via Volume Sampling
, 2006
"... Frieze, Kannan, and Vempala (JACM 2004) proved that a small sample of rows of a given matrix A spans the rows of a lowrank approximation D that minimizes A−DF within a small additive error, and the sampling can be done efficiently using just two passes over the matrix. In this paper, we genera ..."
Abstract

Cited by 92 (3 self)
 Add to MetaCart
Frieze, Kannan, and Vempala (JACM 2004) proved that a small sample of rows of a given matrix A spans the rows of a lowrank approximation D that minimizes A−DF within a small additive error, and the sampling can be done efficiently using just two passes over the matrix. In this paper, we generalize this result in two ways. First, we prove that the additive error drops exponentially by iterating the sampling in an adaptive manner (adaptive sampling). Using this result, we give a passefficient algorithm for computing a lowrank approximation with reduced additive error. Our second result is that there exist k rows of A whose span contains the rows of a multiplicative (k + 1)approximation to the best rankk matrix; moreover, this subset can be found by sampling ksubsets of rows from a natural distribution (volume sampling). Combining volume sampling with adaptive sampling yields the existence of a set of k + k(k + 1)/ε rows whose span contains the rows of a multiplicative (1 + ε)approximation. This leads to a PTAS for the following NPhard
A nearoptimal algorithm for computing the entropy of a stream
 In ACMSIAM Symposium on Discrete Algorithms
, 2007
"... We describe a simple algorithm for approximating the empirical entropy of a stream of m values in a single pass, using O(ε −2 log(δ −1) log m) words of space. Our algorithm is based upon a novel extension of a method introduced by Alon, Matias, and Szegedy [1]. We show a space lower bound of Ω(ε −2 ..."
Abstract

Cited by 74 (20 self)
 Add to MetaCart
We describe a simple algorithm for approximating the empirical entropy of a stream of m values in a single pass, using O(ε −2 log(δ −1) log m) words of space. Our algorithm is based upon a novel extension of a method introduced by Alon, Matias, and Szegedy [1]. We show a space lower bound of Ω(ε −2 / log(ε −1)), meaning that our algorithm is near optimal in terms of its dependency on ε. This improves over previous work on this problem [8, 13, 17, 5]. We show that generalizing to kth order entropy requires close to linear space for all k ≥ 1, and give additive approximations using our algorithm. Lastly, we show how to compute a multiplicative approximation to the entropy of a random walk on an undirected graph. 1
Efficient semistreaming algorithms for local triangle counting in massive graphs
 in KDD’08, 2008
"... In this paper we study the problem of local triangle counting in large graphs. Namely, given a large graph G = (V, E) we want to estimate as accurately as possible the number of triangles incident to every node v ∈ V in the graph. The problem of computing the global number of triangles in a graph ha ..."
Abstract

Cited by 70 (4 self)
 Add to MetaCart
(Show Context)
In this paper we study the problem of local triangle counting in large graphs. Namely, given a large graph G = (V, E) we want to estimate as accurately as possible the number of triangles incident to every node v ∈ V in the graph. The problem of computing the global number of triangles in a graph has been considered before, but to our knowledge this is the first paper that addresses the problem of local triangle counting with a focus on the efficiency issues arising in massive graphs. The distribution of the local number of triangles and the related local clustering coefficient can be used in many interesting applications. For example, we show that the measures we compute can help to detect the presence of spamming activity in largescale Web graphs, as well as to provide useful features to assess content quality in social networks. For computing the local number of triangles we propose two approximation algorithms, which are based on the idea of minwise independent permutations (Broder et al. 1998). Our algorithms operate in a semistreaming fashion, using O(V ) space in main memory and performing O(log V ) sequential scans over the edges of the graph. The first algorithm we describe in this paper also uses O(E) space in external memory during computation, while the second algorithm uses only main memory. We present the theoretical analysis as well as experimental results in massive graphs demonstrating the practical efficiency of our approach. Luca Becchetti was partially supported by EU Integrated
Graph distances in the streaming model: the value of space
 In ACMSIAM Symposium on Discrete Algorithms
, 2005
"... We investigate the importance of space when solving problems based on graph distance in the streaming model. In this model, the input graph is presented as a stream of edges in an arbitrary order. The main computational restriction of the model is that we have limited space and therefore cannot stor ..."
Abstract

Cited by 69 (11 self)
 Add to MetaCart
We investigate the importance of space when solving problems based on graph distance in the streaming model. In this model, the input graph is presented as a stream of edges in an arbitrary order. The main computational restriction of the model is that we have limited space and therefore cannot store all the streamed data; we are forced to make spaceefficient summaries of the data as we go along. For a graph of n vertices and m edges, we show that testing many graph properties, including connectivity (ergo any reasonable decision problem about distances) and bipartiteness, requires Ω(n) bits of space. Given this, we then investigate how the power of the model increases as we relax our space restriction. Our main result is an efficient randomized algorithm that constructs a (2t + 1)spanner in one pass. With high probability, it uses O(t · n 1+1/t log 2 n) bits of space and processes each edge in the stream in O(t 2 · n 1/t log n) time. We find approximations to diameter and girth via the log n constructed spanner. For t = Ω (), the space log log n requirement of the algorithm is O(n·polylog n), and the peredge processing time is O(polylog n). We also show a corresponding lower bound of t for the approximation ratio achievable when the space restriction is O(t · n1+1/t log 2 n). We then consider the scenario in which we are allowed multiple passes over the input stream. Here, we investigate whether allowing these extra passes will compensate for a given space restriction. We show that ∗This work was supported by the DoD University Research Initiative (URI) administered by the Office of Naval Research
LinkBased Characterization and Detection of Web Spam
 In AIRWeb
, 2006
"... We perform a statistical analysis of a large collection of Web pages, focusing on spam detection. We study several metrics such as degree correlations, number of neighbors, rank propagation through links, TrustRank and others to build several automatic web spam classifiers. This paper presents a stu ..."
Abstract

Cited by 55 (8 self)
 Add to MetaCart
We perform a statistical analysis of a large collection of Web pages, focusing on spam detection. We study several metrics such as degree correlations, number of neighbors, rank propagation through links, TrustRank and others to build several automatic web spam classifiers. This paper presents a study of the performance of each of these classifiers alone, as well as their combined performance. Using this approach we are able to detect 80.4% of the Web spam in our sample, with only 1.1% of false positives.
New streaming algorithms for counting triangles in graphs
 In COCOON
, 2005
"... Abstract. We present three streaming algorithms that (ɛ, δ) − approximate 1 the number of triangles in graphs. Similar to the previous algorithms [3], the space usage of presented algorithms are inversely proportional to the number of triangles while, for some families of graphs, the space usage is ..."
Abstract

Cited by 53 (0 self)
 Add to MetaCart
(Show Context)
Abstract. We present three streaming algorithms that (ɛ, δ) − approximate 1 the number of triangles in graphs. Similar to the previous algorithms [3], the space usage of presented algorithms are inversely proportional to the number of triangles while, for some families of graphs, the space usage is improved. We also prove a lower bound, based on the number of triangles, which indicates that our first algorithm behaves almost optimally on graphs with constant degrees. 1
Space efficient mining of multigraph streams
 In Proceedings of the ACM SIGMODSIGACTSIGART Symposium on Principles of Database Systems (PODS
"... The challenge of monitoring massive amounts of data generated by communication networks has led to the interest in data stream processing. We study streams of edges in massive communication multigraphs, defined by (source, destination) pairs. The goal is to compute properties of the underlying g ..."
Abstract

Cited by 48 (8 self)
 Add to MetaCart
(Show Context)
The challenge of monitoring massive amounts of data generated by communication networks has led to the interest in data stream processing. We study streams of edges in massive communication multigraphs, defined by (source, destination) pairs. The goal is to compute properties of the underlying graph while using small space (much smaller than the number of communicants), and to avoid bias introduced because some edges may appear many times, while others are seen only once. We give results for three fundamental problems on multigraph degree sequences: estimating frequency moments of degrees, finding the heavy hitter degrees, and computing range sums of degree values. In all cases we are able to show space bounds for our summarizing algorithms that are significantly smaller than storing complete information. We use a variety of data stream methods: sketches, sampling, hashing and distinct counting, but a common feature is that we use cascaded summaries: nesting multiple estimation techniques within one another. In our experimental study, we see that such summaries are highly effective, enabling massive multigraph streams to be effectively summarized to answer queries of interest with high accuracy using only a small amount of space. 1.
Graph sketches: sparsification, spanners, and subgraphs
 In PODS
, 2012
"... When processing massive data sets, a core task is to construct synopses of the data. To be useful, a synopsis data structure should be easy to construct while also yielding good approximations of the relevant properties of the data set. A particularly useful class of synopses are sketches, i.e., tho ..."
Abstract

Cited by 46 (9 self)
 Add to MetaCart
(Show Context)
When processing massive data sets, a core task is to construct synopses of the data. To be useful, a synopsis data structure should be easy to construct while also yielding good approximations of the relevant properties of the data set. A particularly useful class of synopses are sketches, i.e., those based on linear projections of the data. These are applicable in many models including various parallel, stream, and compressed sensing settings. A rich body of analytic and empirical work exists for sketching numerical data such as the frequencies of a set of entities. Our work investigates graph sketching where the graphs of interest encode the relationships between these entities. The main challenge is to capture this richer structure and build the necessary synopses with only linear measurements. In this paper we consider properties of graphs including the size of the cuts, the distances between nodes, and the prevalence of
Estimating statistical aggregates on probabilistic data streams
 ACM Trans. Database Syst
, 2008
"... The probabilistic stream model was introduced by Jayram, Kale, and Vee [2007]. It is a generalization of the data stream model that is suited to handling “probabilistic ” data, where each item of the stream represents a probability distribution over a set of possible events. Therefore, a probabilist ..."
Abstract

Cited by 40 (5 self)
 Add to MetaCart
The probabilistic stream model was introduced by Jayram, Kale, and Vee [2007]. It is a generalization of the data stream model that is suited to handling “probabilistic ” data, where each item of the stream represents a probability distribution over a set of possible events. Therefore, a probabilistic stream determines a distribution over a potentially exponential number of classical “deterministic ” streams where each item is deterministically one of the domain values. We present algorithms for computing commonly used aggregates on a probabilistic stream. We present the first one pass streaming algorithms for estimating the expected mean of a probabilistic stream. Next, we consider the problem of estimating frequency moments for probabilistic data. We propose a general approach to obtain unbiased estimators working over probabilistic data by utilizing unbiased estimators designed for standard streams. Applying this approach, we extend a classical data stream algorithm to obtain a onepass algorithm for estimating F2, the second frequency moment. We present the first known streaming algorithms for estimating F0, the number of distinct items on probabilistic streams. Our work also gives an efficient onepass algorithm for estimating the median and a twopass algorithm for estimating the range.
Trading off space for passes in graph streaming problems
 In ACMSIAM SODA. 714–723
, 2006
"... Data stream processing has recently received increasing attention as a computational paradigm for dealing with massive data sets. Surprisingly, no algorithm with both sublinear space and passes is known for natural graph problems in classical readonly streaming. Motivated by technological factors o ..."
Abstract

Cited by 36 (4 self)
 Add to MetaCart
Data stream processing has recently received increasing attention as a computational paradigm for dealing with massive data sets. Surprisingly, no algorithm with both sublinear space and passes is known for natural graph problems in classical readonly streaming. Motivated by technological factors of modern storage systems, some authors have recently started to investigate the computational power of less restrictive models where writing streams is allowed. In this paper, we show that the use of intermediate temporary streams is powerful enough to provide effective spacepasses tradeoffs for natural graph problems. In particular, for any space restriction of s bits, we show that singlesource shortest paths in directed graphs with small positive integer edge weights can be solved in O((n log 3/2 n) / √ s) passes. The result can be generalized to deal with multiple sources within the same bounds. This is the first known streaming algorithm for shortest paths in directed graphs. For undirected connectivity, we devise an O((n log n)/s) passes algorithm. Both problems require Ω(n/s) passes under the restrictions we consider. We also show that the model where intermediate temporary streams are allowed can be strictly more powerful than classical streaming for some problems, while maintaining all of its hardness for others.