Results 1 - 10
of
13
Graph-Based Data Clustering with Overlaps
- TO APPEAR IN DISCRETE OPTIMIZATION,
, 2010
"... We introduce overlap cluster graph modification problems where, other than in most previous work, the clusters of the target graph may overlap. More precisely, the studied graph problems ask for a minimum number of edge modifications such that the resulting graph consists of clusters (that is, maxim ..."
Abstract
-
Cited by 6 (3 self)
- Add to MetaCart
We introduce overlap cluster graph modification problems where, other than in most previous work, the clusters of the target graph may overlap. More precisely, the studied graph problems ask for a minimum number of edge modifications such that the resulting graph consists of clusters (that is, maximal cliques) that may overlap up to a certain amount specified by the overlap number s. In the case of s-vertex-overlap, each vertex may be part of at most s maximal cliques; s-edge-overlap is analogously defined in terms of edges. We provide a complexity dichotomy (polynomial-time solvable versus NP-hard) for the underlying edge modification problems, develop forbidden subgraph characterizations of “cluster graphs with overlaps”, and study the parameterized complexity in terms of the number of allowed edge modifications, achieving fixed-parameter tractability (in case of constant s-values) and parameterized hardness (in case of unbounded s-values).
A Large-Scale Study of Link Spam Detection by Graph Algorithms
"... Link spam refers to attempts to promote the ranking of spammers ’ web sites by deceiving link-based ranking algorithms in search engines. Spammers often create densely connected link structure of sites so called “link farm”. In this paper, we study the overall structure and distribution of link farm ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
Link spam refers to attempts to promote the ranking of spammers ’ web sites by deceiving link-based ranking algorithms in search engines. Spammers often create densely connected link structure of sites so called “link farm”. In this paper, we study the overall structure and distribution of link farms in a large-scale graph of the Japanese Web with 5.8 million sites and 283 million links. To examine the spam structure, we apply three graph algorithms to the web graph. First, the web graph is decomposed into strongly connected components (SCC). Beside the largest SCC (core) in the center of the web, we have observed that most of large components consist of link farms. Next, to extract spam sites in the core, we enumerate maximal cliques as seeds of link farms. Finally, we expand these link farms as a reliable spam seed set by a minimum cut technique that separates links among spam and non-spam sites. We found about 0.6 million spam sites in SCCs around the core, and extracted additional 8 thousand and 49 thousand sites as spams with high precision in the core by the maximal clique enumeration and by the minimum cut technique, respectively. 1.
An efficient algorithm for the extended (l,d)-motif problem with unknown number of binding sites
- Proc. BIBE
, 2005
"... Finding common patterns, or motifs, from a set of DNA sequences is an important problem in molecular biology. Most motif-discovering algorithms/software require the length of the motif as input. Motivated by the fact that the motif’s length is usually unknown in practice, Styczynski et al. introduce ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
Finding common patterns, or motifs, from a set of DNA sequences is an important problem in molecular biology. Most motif-discovering algorithms/software require the length of the motif as input. Motivated by the fact that the motif’s length is usually unknown in practice, Styczynski et al. introduced the Extended (l,d)-Motif Problem (EMP), where the motif’s length is not an input parameter. Unfortunately, the algorithm given by Styczynski et al. to solve EMP can take an unacceptably long time to run, e.g. over 3 months to discover a length-14 motif. This paper makes two main contributions. First, we eliminate another input parameter from EMP: the minimum number of binding sites in the DNA sequences. Fewer input parameters not only reduces the burden of the user, but also may give more realistic/robust results since restrictions on length or on the number of binding sites make little sense when the best motif may not be the longest nor have the largest number of binding sites. Second, we develop an efficient algorithm to solve our redefined problem. The algorithm is also a fast solution for EMP (without any sacrifice to accuracy) making EMP practical. 1.
Maximal Biclique Subgraphs and Closed Pattern Pairs of the Adjacency Matrix: A One-to-one Correspondence and Mining Algorithms
, 2007
"... Maximal biclique (also known as complete bipartite) subgraphs can model many applications in web mining, business, and bioinformatics. Enumerating maximal biclique subgraphs from a graph is a computationally challenging problem, as the size of the output can become exponentially large with respect ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
Maximal biclique (also known as complete bipartite) subgraphs can model many applications in web mining, business, and bioinformatics. Enumerating maximal biclique subgraphs from a graph is a computationally challenging problem, as the size of the output can become exponentially large with respect to the vertex number when the graph grows. In this paper, we efficiently enumerate them through the use of closed patterns of the adjacency matrix of the graph. For an undirected graph G without self-loops, we prove that: (i) the number of closed patterns in the adjacency matrix of G is even; (ii) the number of the closed patterns is precisely double the number of maximal biclique subgraphs of G; and (iii) for every maximal biclique subgraph, there always exists a unique pair of closed patterns that matches the two vertex sets of the subgraph. Therefore, the problem of enumerating maximal bicliques can be solved by using efficient algorithms for mining closed patterns, which are algorithms extensively studied in the data mining field. However, this direct use of existing algorithms causes a duplicated enumeration. To achieve high efficiency, we propose an O(mn) time delay algorithm for a non-duplicated enumeration, in particular for enumerating those maximal bicliques with a large size, where m and n are the number of edges and vertices of the graph respectively. We evaluate the high efficiency of our algorithm by comparing it to state-of-the-art algorithms on three categories of graphs: randomly generated graphs, benchmarks, and a real-life protein interaction network. In this paper, we also prove that if self-loops are allowed in a graph, then the number of closed patterns in the adjacency matrix is not necessarily even; but the maximal bicliques are exactly the same as those of the graph after removing all the self-loops.
Online optimization of 802.11 mesh networks
- In Proc. of CoNEXT
, 2009
"... 802.11 wireless mesh networks are ubiquitous, but suffer from severe performance degradations due to poor synergy between the 802.11 CSMA MAC protocol and higher layers. Several solutions have been proposed that either involve significant modifications to the 802.11 MAC or legacy higher layer protoc ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
802.11 wireless mesh networks are ubiquitous, but suffer from severe performance degradations due to poor synergy between the 802.11 CSMA MAC protocol and higher layers. Several solutions have been proposed that either involve significant modifications to the 802.11 MAC or legacy higher layer protocols, or rely on 802.11 MAC models seeded with off-line measurements performed during network downtime. We introduce a technique for online optimization of 802.11 wireless mesh networks using rate control at the network layer. The technique is based on a lightweight model that characterizes the feasible rates region of an operational 802.11 wireless mesh network. Unlike existing 802.11 modeling approaches, the parameters of this model can be estimated online, incur minimal overhead and can be realized using standard probing mechanisms at the network layer. Using analysis and extensive measurements over a wireless mesh network testbed, we validate the assumptions on which the model is built, and explain the principles behind the choice and estimation of its parameters. The benefits of the model and its solution in terms of fairness, throughput and stability are demonstrated operationally for a range of multi-hop topologies and configurations.
Finding Maximal Cliques in Massive Networks by H*-graph
"... Maximal clique enumeration (MCE) is a fundamental problem in graph theory and has important applications in many areas such as social network analysis and bioinformatics. The problem is extensively studied; however, the best existing algorithms require memory space linear in the size of the input gr ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
Maximal clique enumeration (MCE) is a fundamental problem in graph theory and has important applications in many areas such as social network analysis and bioinformatics. The problem is extensively studied; however, the best existing algorithms require memory space linear in the size of the input graph. This has become a serious concern in view of the massive volume of today’s fastgrowing network graphs. Since MCE requires random access to different parts of a large graph, it is difficult to divide the graph into smaller parts and process one part at a time, because either the result may be incorrect and incomplete, or it incurs huge cost on merging the results from different parts. We propose a novel notion, H ∗-graph, which defines the core of a network and extends to encompass the neighborhood of the core for MCE computation. We propose the first external-memory algorithm for MCE (ExtMCE) that uses theH ∗-graph to bound the memory usage. We prove both the correctness and completeness of the result computed by ExtMCE. Extensive experiments verify that ExtMCE efficiently processes large networks that cannot be fit in the memory. We also show that the H ∗-graph captures important properties of the network; thus, updating the maximal cliques in the H ∗-graph retains the most essential information, with a low update cost, when it is infeasible to perform update on the entire network.
A correspondence between maximal complete bipartite subgraphs and closed patterns
- In Proc. of the 9th PKDD Conference
, 2005
"... For an undirected graph ¢ without self-loop, we prove: (i) that the number of closed patterns in the adjacency matrix of ¢ is even; (ii) that the number of the closed patterns is precisely double the number of maximal complete bipartite subgraphs of ¢ ; (iii) that for every maximal complete bipartit ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
For an undirected graph ¢ without self-loop, we prove: (i) that the number of closed patterns in the adjacency matrix of ¢ is even; (ii) that the number of the closed patterns is precisely double the number of maximal complete bipartite subgraphs of ¢ ; (iii) that for every maximal complete bipartite subgraph, there always exists a unique and distinct pair of closed patterns that matches the two vertex sets of the subgraph. Therefore, we can efficiently enumerate all maximal complete bipartite subgraphs by using algorithms for mining closed patterns which have been extensively studied in the data mining field. 1
Maximal Quasi-Bicliques with Balanced Noise Tolerance: Concepts and Co-clustering Applications
, 2008
"... The rigid all-versus-all adjacency required by a maximal biclique for its two vertex sets is extremely vulnerable to missing data. In the past, several types of quasi-bicliques have been proposed to tackle this problem, however their noise tolerance is usually unbalanced and can be very skewed. In t ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
The rigid all-versus-all adjacency required by a maximal biclique for its two vertex sets is extremely vulnerable to missing data. In the past, several types of quasi-bicliques have been proposed to tackle this problem, however their noise tolerance is usually unbalanced and can be very skewed. In this paper, we improve the noise tolerance of maximal quasi-bicliques by allowing every vertex to tolerate up to the same number, or the same percentage, of missing edges. This idea leads to a more natural interaction between the two vertex sets— a balanced most-versus-most adjacency. This generalization is also non-trivial, as many large-size maximal quasi-biclique subgraphs do not contain any maximal bicliques. This observation implies that direct expansion from maximal bicliques may not guarantee a complete enumeration of all maximal quasi-bicliques. We present important properties of maximal quasi-bicliques such as a bounded closure property and a fixed point property to design efficient algorithms. Maximal quasi-bicliques are closely related to co-clustering problems such as documents and words co-clustering, images and features coclustering, stocks and financial ratios co-clustering, etc. Here, we demonstrate the usefulness of our concepts using a new application—a bioinformatics example— where prediction of true protein interactions is investigated.
On Independent Sets and Bicliques in Graphs
"... Bicliques of graphs have been studied extensively, partially motivated by the large number of applications. One of the main algorithmic interests is in designing algorithms to enumerate all maximal bicliques of a (bipartite) graph. Polynomial-time reductions have been used explicitly or implicitly t ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Bicliques of graphs have been studied extensively, partially motivated by the large number of applications. One of the main algorithmic interests is in designing algorithms to enumerate all maximal bicliques of a (bipartite) graph. Polynomial-time reductions have been used explicitly or implicitly to design polynomial delay algorithms to enumerate all maximal bicliques. Based on polynomial-time Turing reductions, various algorithmic problems on (maximal) bicliques can be studied by considering the related problem for (maximal) independent sets. In this line of research, we improve Prisner’s upper bound on the number of maximal bicliques [Combinatorica, 2000] and show that the maximum number of maximal bicliques in a graph on n vertices is exactly 3 n/3 (up to a polynomial factor). The main results of this paper are O(1.3642 n) time algorithms to compute the number of maximal independent sets and maximal bicliques in a graph.
Subsumption and Complementation as Data Fusion Operators
"... The goal of data fusion is to combine several representations of one real world object into a single, consistent representation, e.g., in data integration. A very popular operator to perform data fusion is the minimum union operator. It is defined as the outer union and the subsequent removal of sub ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
The goal of data fusion is to combine several representations of one real world object into a single, consistent representation, e.g., in data integration. A very popular operator to perform data fusion is the minimum union operator. It is defined as the outer union and the subsequent removal of subsumed tuples. Minimum union is used in other applications as well, for instance in database query optimization to rewrite outer join queries, in the semantic web community in implementing SPARQL’s OPTIONAL operator, etc. Despite its wide applicability, there are only few efficient implementations, and until now, minimum union is not a relational database primitive. This paper fills this gap as we present implementations of subsumption that serve as a building block for minimum union. Furthermore, we consider this operator as database primitive and show how to perform optimization of query plans in presence of subsumption and minimum union through rule-based plan transformations. Experiments on both artificial and real world data show that our algorithms outperform existing algorithms used for subsumption in terms of runtime and they scale to large volumes of data. In the context of data integration, we observe that performing data fusion calls for more than subsumption and minimum union. Therefore, another contribution of this paper is the definition of the complementation and complement union operators. Intuitively, these allow to merge tuples that have complementing values and thus eliminate unnecessary null-values. Research was partially performed while at Hasso-Plattner-Institut. Research was partially performed while at Hasso-Plattner-Institut

