#### DMCA

## Web graph similarity for anomaly detection (2010)

Venue: | Journal of Internet Services and Applications |

Citations: | 29 - 4 self |

### Citations

990 | Bigtable : A Distributed Storage System for Structured Data - Chang, Dean, et al. |

574 | Similarity Flooding: A Versatile Graph Matching Algorithm
- Melnik, Molina, et al.
- 2002
(Show Context)
Citation Context ...built upon the notion of vertex similarity, which in turn is based on the rule that “two vertices are similar if their neighbors are similar”. This rule appears to have been independently proposed by =-=[2, 16, 17]-=-. These references proposed different ways of computing similarity scores for pairs of vertices. For some of these references, it is also possible to compute a similarity score for graphs once a simil... |

516 |
Syntactic clustering of the Web
- Broder, Glassman, et al.
- 1997
(Show Context)
Citation Context ...overage requirements, it is effective in detecting some type of anomalies that we discuss in Section 6. Two main ways of computing this kind of similarity is graph edit distance and the Jaccard index =-=[5]-=- (as we apply to graphs). Graph edit distance [6] counts the number of some operations on vertices and edges to transform one graph to the other. The operations consist of insertions, deletions, and i... |

430 | Similarity estimation techniques from rounding algorithms
- Charikar
- 2002
(Show Context)
Citation Context ...ty feature space (signatures) for comparison. There are many ways to compare sets of features, but here we focus on a scheme called SimHash originally developed for high dimensional vector comparison =-=[8]-=- and is applied to documents comparison [15]. Again, our challenge is in converting our graphs into appropriate sets of features to be input into the SimHash algorithm. We start by reviewing the SimHa... |

370 | Simrank: a measure of structural-context similarity
- Jeh, Widom
- 2002
(Show Context)
Citation Context ...built upon the notion of vertex similarity, which in turn is based on the rule that “two vertices are similar if their neighbors are similar”. This rule appears to have been independently proposed by =-=[2, 16, 17]-=-. These references proposed different ways of computing similarity scores for pairs of vertices. For some of these references, it is also possible to compute a similarity score for graphs once a simil... |

279 | The evolution of the Web and implications for an incremental crawler
- Cho, Garcia-Molina
- 2000
(Show Context)
Citation Context ...the presented similarity schemes could quantify the changes of the web graphs over time. These changes are imposed by the natural evolution of the web and grow as time passes. This fact is studied in =-=[9, 18]-=- and is also confirmed by Fig. 3(c). In particular, if we have a sequence of web graphs generated from crawls that were obtained sequentially over time, we expect the similarity to decrease over time.... |

265 | Comparing top k lists
- Fagin, Kumar, et al.
- 2003
(Show Context)
Citation Context ...straints: (1) we want the rank correlation result to be sensitive to quality, and (2) we want to compute rank correlation for two lists that are not permutations of each 6 other. As such, inspired by =-=[12]-=-, we revise the formula in Eq. 2 to a similarity measure as simV R(G,G′) = 1− 2 ∑ v∈V ∪V ′ wv × (piv − pi′v)2 D (3) where piv and pi′v are the ranks of v in the sorted list for G and G′, respectively,... |

234 |
Nonlinear Time Series: Nonparametric and Parametric Methods
- Fan, Yao
- 2003
(Show Context)
Citation Context ...ermine similarity thresholds automatically and in a statistically sound way, we use in our production implementation both non-parametric and parametric methods from time series forecasting, e.g., see =-=[13]-=-. However, the discussion of these methods is beyond the scope of this paper. Here, we use a simple method with a fixed threshold t for each algorithm that works very well in practice. To understand h... |

216 | What’s new on the Web?: The evolution of the web from a search engine perspective
- Ntoulas, Cho, et al.
- 2004
(Show Context)
Citation Context ...the presented similarity schemes could quantify the changes of the web graphs over time. These changes are imposed by the natural evolution of the web and grow as time passes. This fact is studied in =-=[9, 18]-=- and is also confirmed by Fig. 3(c). In particular, if we have a sequence of web graphs generated from crawls that were obtained sequentially over time, we expect the similarity to decrease over time.... |

112 | Finding near-duplicate Web pages: A large-scale evaluation of algorithms
- Henzinger
- 2006
(Show Context)
Citation Context ... family are used to compare objects that are naturally sequenced, e.g., documents that consist of a sequence of words. For example, shingling [5] is frequently used to detect near-duplicate web pages =-=[15]-=-. Because sequence comparison algorithms are efficient and can operate on large inputs, we want to consider them as candidates for our problem. Thus, here we use a sequence comparison scheme, shinglin... |

111 | Ranking the web frontier
- Eiron, Curley, et al.
- 2004
(Show Context)
Citation Context ...These properties can be numerical (scalars or distributions) or categorical (labels or lists). Some vertex properties that we will focus on are PageRank as a “quality” score (computed in a host graph =-=[11]-=-), the list of hosts pointing to it (called its inlinks), and the list of hosts it is pointing to (called its outlinks). 3 Potential Anomalies Since our goal is anomaly detection, we now give examples... |

108 | Dooren, “Measure of similarity between graph vertices: Applications to synonym extraction and web searching
- Blondel, Gajardo, et al.
- 2004
(Show Context)
Citation Context ...built upon the notion of vertex similarity, which in turn is based on the rule that “two vertices are similar if their neighbors are similar”. This rule appears to have been independently proposed by =-=[2, 16, 17]-=-. These references proposed different ways of computing similarity scores for pairs of vertices. For some of these references, it is also possible to compute a similarity score for graphs once a simil... |

105 |
Discovering large dense subgraphs in massive graphs
- Gibson, Kumar, et al.
- 2005
(Show Context)
Citation Context ...rting the graphs into linear sequences that can then be compared using shingling. As far as we know, our proposal is the first application of shingling to graph similarity. However, a related work is =-=[14]-=-, where shingling was applied to the detection of large dense subgraphs. We start the description of our Sequence Similarity Algorithm by reviewing the base shingling scheme of [5]. We then provide a ... |

90 | A Survey of Web Metrics
- Dhyani, Ng, et al.
- 2002
(Show Context)
Citation Context ...this approach is proposed by [19], which computes the similarity of vertex degree distributions. We believe this variation can readily be extended to other graph properties such as those discussed in =-=[10]-=-. Similarly, if one has edge properties, it is possible to construct edge vectors and compare them. For our particular similarity measure, we compare edges, giving each edge a weight that captures the... |

82 | MIL Primitives for Querying a Fragmented World
- Boncz, Kersten
- 1999
(Show Context)
Citation Context ... bugfree. For example, modern search engines use column-orientation in flat fragmented files to store graph data, since this approach has many advantages for storing and processing search engine data =-=[3, 7]-=-. However, this approach requires the development of custom code to access the graph data, e.g., to 4 fetch vertex names, to join vertices with edges, etc.. If there is bug in the code that joins vert... |

59 | Link analysis ranking: algorithms, theory, and experiments
- Borodin, Roberts, et al.
- 2005
(Show Context)
Citation Context ... Spearman’s rho (denoted ρ). Although rank correlation is well known in the information retrieval field, its application to graph similarity appears to be new. A very related application, proposed in =-=[4]-=-, is to the similarity of ranking algorithms. The particular vertex ranking algorithm we use proceeds as follows. Let G = (V,E) and G′ = (V ′, E′) be the two graphs that we want to compare. For each g... |

39 | Graph edit distance from spectral seriation
- Robles-Kelly, Hancock
- 2005
(Show Context)
Citation Context ...hs, respectively, where Q[i] is the quality q(vi) of vertex vi. Then we compare the two vectors by computing the average difference between all Q[i], Q′[i]. For some works that use this approach, see =-=[6, 20, 24]-=-. A slight variation on this approach is proposed by [19], which computes the similarity of vertex degree distributions. We believe this variation can readily be extended to other graph properties suc... |

37 | Exploiting the hierarchical structure for link analysis
- Xue, Yang, et al.
- 2005
(Show Context)
Citation Context ...sing. In our paper we focus on host-level web graphs (called host graphs), since they are extensively used in search industry. For some advantages of hostlevel graphs in link analysis of the Web, see =-=[22]-=-. A (host-level) web graph is a directed, weighted graph whose vertices correspond to active hosts of the web, and whose weighted edges aggregate the hyperlinks of web pages in these hosts. We represe... |

26 | Structure-Based Similarity Search with Graph Histograms
- Papadopoulos, Manolopoulos
- 1999
(Show Context)
Citation Context ...Then we compare the two vectors by computing the average difference between all Q[i], Q′[i]. For some works that use this approach, see [6, 20, 24]. A slight variation on this approach is proposed by =-=[19]-=-, which computes the similarity of vertex degree distributions. We believe this variation can readily be extended to other graph properties such as those discussed in [10]. Similarly, if one has edge ... |

19 | The distribution of PageRank follows a power-law only for particular values of the damping factor
- Becchetti, Castillo
- 2006
(Show Context)
Citation Context ...e detected by comparing the similarity score of two graphs against some threshold, or by looking for unusual patterns in the time series. 4.2 Similarity Requirements A similarity function sim(G,G′) ∈ =-=[0, 1]-=- has value 1 if G and G′ are identical, and value 0 if G and G′ share no common features. The similarity function needs to satisfy the following requirements to be useful for our domain: 1. Scalabilit... |

17 |
W.D.: A Graph-Theoretic Approach to Enterprise Network Dynamics. Birkhauser
- Bunke, Dickinson, et al.
- 2007
(Show Context)
Citation Context ...The problem of comparing graphs or computing their similarity has been an important problem with applications in many areas, from biological networks to web searching. For an overview, see the 5 book =-=[6]-=- and two web documents [21, 23], or search on the internet using the query “graph similarity”, which returns many useful links in major search engines. Naturally, the diversity of the areas has create... |

12 | A study of graph spectra for comparing graphs
- Zhu, Wilson
(Show Context)
Citation Context ...hs, respectively, where Q[i] is the quality q(vi) of vertex vi. Then we compare the two vectors by computing the average difference between all Q[i], Q′[i]. For some works that use this approach, see =-=[6, 20, 24]-=-. A slight variation on this approach is proposed by [19], which computes the similarity of vertex degree distributions. We believe this variation can readily be extended to other graph properties suc... |

2 |
Graph similarity
- Zager, Verghese
- 2005
(Show Context)
Citation Context ...raphs or computing their similarity has been an important problem with applications in many areas, from biological networks to web searching. For an overview, see the 5 book [6] and two web documents =-=[21, 23]-=-, or search on the internet using the query “graph similarity”, which returns many useful links in major search engines. Naturally, the diversity of the areas has created different approaches to graph... |

1 |
References for graph similarity
- Seidl
- 2007
(Show Context)
Citation Context ...raphs or computing their similarity has been an important problem with applications in many areas, from biological networks to web searching. For an overview, see the 5 book [6] and two web documents =-=[21, 23]-=-, or search on the internet using the query “graph similarity”, which returns many useful links in major search engines. Naturally, the diversity of the areas has created different approaches to graph... |