Results 11  20
of
650
Discriminative frequent pattern analysis for effective classification
 In ICDE
, 2007
"... The application of frequent patterns in classification appeared in sporadic studies and achieved initial success in the classification of relational data, text documents and graphs. In this paper, we conduct a systematic exploration of frequent patternbased classification, and provide solid reasons ..."
Abstract

Cited by 112 (20 self)
 Add to MetaCart
(Show Context)
The application of frequent patterns in classification appeared in sporadic studies and achieved initial success in the classification of relational data, text documents and graphs. In this paper, we conduct a systematic exploration of frequent patternbased classification, and provide solid reasons supporting this methodology. It was well known that feature combinations (patterns) could capture more underlying semantics than single features. However, inclusion of infrequent patterns may not significantly improve the accuracy due to their limited predictive power. By building a connection between pattern frequency and discriminative measures such as information gain and Fisher score, we develop a strategy to set minimum support in frequent pattern mining for generating useful patterns. Based on this strategy, coupled with a proposed feature selection algorithm, discriminative frequent patterns can be generated for building high quality classifiers. We demonstrate that the frequent patternbased classification framework can achieve good scalability and high accuracy in classifying large datasets. Empirical studies indicate that significant improvement in classification accuracy is achieved (up to 12 % in UCI datasets) using the soselected discriminative frequent patterns. 1.
Patterns of influence in a recommendation network,”
 Proc. 10th PacificAsia Conf. on Advances in Knowledge Discovery and Data Mining (PAKDD),
, 2006
"... Abstract. Information cascades are phenomena whereby individuals adopt a new action or idea due to influence by others. As such a process spreads through an underlying social network, it can result in widespread adoption overall. We consider information cascades in the context of recommendations, a ..."
Abstract

Cited by 105 (14 self)
 Add to MetaCart
(Show Context)
Abstract. Information cascades are phenomena whereby individuals adopt a new action or idea due to influence by others. As such a process spreads through an underlying social network, it can result in widespread adoption overall. We consider information cascades in the context of recommendations, and in particular study the patterns of cascading recommendations that arise in large social networks. We investigate a large persontoperson recommendation network, consisting of four million people who made sixteen million recommendations on half a million products. Such a dataset allows to pose a number of fundamental questions: What cascades arise frequently in real life? What features distinguish them? We enumerate and count cascade subgraphs on large directed graphs; as one component of this, we develop a novel efficient heuristic based on graph isomorphism testing that scales to large datasets. We discover novel patterns: the distribution of cascade sizes and depths follows a power law. Generally, cascades tend to be shallow, but occasional large bursts of propagation can occur. Cascade subgraphs are mainly treelike, but we observe variability in connectivity and branching across recommendations for different types of products.
State of the Art of Graphbased Data Mining
, 2003
"... The need for mining structured data has increased in the past few years. One of the best studied data structures in computer science and discrete mathematics are graphs. It can therefore be no surprise that graph based data mining has become quite popular in the last few years. This article introduc ..."
Abstract

Cited by 105 (0 self)
 Add to MetaCart
The need for mining structured data has increased in the past few years. One of the best studied data structures in computer science and discrete mathematics are graphs. It can therefore be no surprise that graph based data mining has become quite popular in the last few years. This article introduces the theoretical basis of graph based data mining and surveys the state of the art of graphbased data mining. Brief descriptions of some representative approaches are provided as well.
Graph Kernels
, 2007
"... We present a unified framework to study graph kernels, special cases of which include the random walk (Gärtner et al., 2003; Borgwardt et al., 2005) and marginalized (Kashima et al., 2003, 2004; Mahé et al., 2004) graph kernels. Through reduction to a Sylvester equation we improve the time complexit ..."
Abstract

Cited by 101 (9 self)
 Add to MetaCart
We present a unified framework to study graph kernels, special cases of which include the random walk (Gärtner et al., 2003; Borgwardt et al., 2005) and marginalized (Kashima et al., 2003, 2004; Mahé et al., 2004) graph kernels. Through reduction to a Sylvester equation we improve the time complexity of kernel computation between unlabeled graphs with n vertices from O(n 6) to O(n 3). We find a spectral decomposition approach even more efficient when computing entire kernel matrices. For labeled graphs we develop conjugate gradient and fixedpoint methods that take O(dn 3) time per iteration, where d is the size of the label set. By extending the necessary linear algebra to Reproducing Kernel Hilbert Spaces (RKHS) we obtain the same result for ddimensional edge kernels, and O(n 4) in the infinitedimensional case; on sparse graphs these algorithms only take O(n 2) time per iteration in all cases. Experiments on graphs from bioinformatics and other application domains show that these techniques can speed up computation of the kernel by an order of magnitude or more. We also show that certain rational kernels (Cortes et al., 2002, 2003, 2004) when specialized to graphs reduce to our random walk graph kernel. Finally, we relate our framework to Rconvolution kernels (Haussler, 1999) and provide a kernel that is close to the optimal assignment kernel of Fröhlich et al. (2006) yet provably positive semidefinite.
Spin: Mining maximal frequent subgraphs from graph databases
 IN KDD
, 2004
"... One fundamental challenge for mining recurring subgraphs from semistructured data sets is the overwhelming abundance of such patterns. In large graph databases, the total number of frequent subgraphs can become too large to allow a full enumeration using reasonable computational resources. In this ..."
Abstract

Cited by 99 (12 self)
 Add to MetaCart
(Show Context)
One fundamental challenge for mining recurring subgraphs from semistructured data sets is the overwhelming abundance of such patterns. In large graph databases, the total number of frequent subgraphs can become too large to allow a full enumeration using reasonable computational resources. In this paper, we propose a new algorithm that mines only maximal frequent subgraphs, i.e. subgraphs that are not a part of any other frequent subgraphs. This may exponentially decrease the size of the output set in the best case; in our experiments on practical data sets, mining maximal frequent subgraphs reduces the total number of mined patterns by two to three orders of magnitude. Our method first mines all frequent trees from a general graph database and then reconstructs all maximal subgraphs from the mined trees. Using two chemical structure benchmarks and a set of synthetic graph data sets, we demonstrate that, in addition to decreasing the output size, our algorithm can achieve a fivefold speed up over the current stateoftheart subgraph mining algorithms.
Link Mining: A Survey
 SigKDD Explorations Special Issue on Link Mining
, 2005
"... Many datasets of interest today are best described as a linked collection of interrelated objects. These may represent homogeneous networks, in which there is a singleobject type and link type, or richer, heterogeneous networks, in which there may be multiple object and link types (and possibly oth ..."
Abstract

Cited by 84 (0 self)
 Add to MetaCart
(Show Context)
Many datasets of interest today are best described as a linked collection of interrelated objects. These may represent homogeneous networks, in which there is a singleobject type and link type, or richer, heterogeneous networks, in which there may be multiple object and link types (and possibly other semantic information). Examples of homogeneous networks include single mode social networks, such as people connected by friendship links, or the WWW, a collection of linked web pages. Examples of heterogeneous networks include those in medical domains describing patients, diseases, treatments and contacts, or in bibliographic domains describing publications, authors, and venues. Link mining refers to data mining techniques that explicitly consider these links when building predictive or descriptive models of the linked data. Commonly addressed link mining tasks include object ranking, group detection, collective classification, link prediction and subgraph discovery. While network analysis has been studied in depth in particular areas such as social network analysis, hypertext mining, and web analysis, only recently has there been a crossfertilization of ideas among these different communities. This is an exciting, rapidly expanding area. In this article, we review some of the common emerging themes. 1.
Efficient Aggregation for Graph Summarization
"... Graphs are widely used to model real world objects and their relationships, and large graph datasets are common in many application domains. To understand the underlying characteristics of large graphs, graph summarization techniques are critical. However, existing graph summarization methods are mo ..."
Abstract

Cited by 83 (5 self)
 Add to MetaCart
(Show Context)
Graphs are widely used to model real world objects and their relationships, and large graph datasets are common in many application domains. To understand the underlying characteristics of large graphs, graph summarization techniques are critical. However, existing graph summarization methods are mostly statistical (studying statistics such as degree distributions, hopplots and clustering coefficients). These statistical methods are very useful, but the resolutions of the summaries are hard to control. In this paper, we introduce two databasestyle operations to summarize graphs. Like the OLAPstyle aggregation methods that allow users to drilldown or rollup to control the resolution of summarization, our methods provide an analogous functionality for large graph datasets. The first operation, called SNAP, produces a summary graph by grouping nodes based on userselected node attributes and relationships. The second operation, called kSNAP, further allows users to control the resolutions of summaries and provides the “drilldown ” and “rollup ” abilities to navigate through summaries with different resolutions. We propose an efficient algorithm to evaluate the SNAP operation. In addition, we prove that the kSNAP computation is NPcomplete. We propose two heuristic methods to approximate the kSNAP results. Through extensive experiments on a variety of real and synthetic datasets, we demonstrate the effectiveness and efficiency of the proposed methods.
OddBall: Spotting Anomalies in Weighted Graphs
"... Abstract. Given a large, weighted graph, how can we find anomalies? Which rules should be violated, before we label a node as an anomaly? We propose the OddBall algorithm, to find such nodes. The contributions are the following: (a) we discover several new rules (power laws) in density, weights, ran ..."
Abstract

Cited by 77 (28 self)
 Add to MetaCart
(Show Context)
Abstract. Given a large, weighted graph, how can we find anomalies? Which rules should be violated, before we label a node as an anomaly? We propose the OddBall algorithm, to find such nodes. The contributions are the following: (a) we discover several new rules (power laws) in density, weights, ranks and eigenvalues that seem to govern the socalled “neighborhood subgraphs ” and we show how to use these rules for anomaly detection; (b) we carefully choose features, and design OddBall, so that it is scalable and it can work unsupervised (no userdefined constants) and (c) we report experiments on many real graphs with up to 1.6 million nodes, where OddBall indeed spots unusual nodes that agree with intuition. 1
Fgindex: towards verificationfree query processing on graph databases
 in SIGMOD, 2007
"... Graphs are prevalently used to model the relationships between objects in various domains. With the increasing usage of graph databases, it has become more and more demanding to efficiently process graph queries. Querying graph databases is costly since it involves subgraph isomorphism testing, whic ..."
Abstract

Cited by 77 (10 self)
 Add to MetaCart
(Show Context)
Graphs are prevalently used to model the relationships between objects in various domains. With the increasing usage of graph databases, it has become more and more demanding to efficiently process graph queries. Querying graph databases is costly since it involves subgraph isomorphism testing, which is an NPcomplete problem. In recent years, some effective graph indexes have been proposed to first obtain a candidate answer set by filtering part of the false results and then perform verification on each candidate by checking subgraph isomorphism. Query performance is improved since the number of subgraph isomorphism tests is reduced. However, candidate verification is still inevitable, which can be expensive when the size of the candidate answer set is large. In this paper, we propose a novel indexing technique that constructs a nested invertedindex, called FGindex, based on the set of Frequent subGraphs (FGs). Given a graph query that is an FG in the database, FGindex returns the exact set of query answers without performing candidate verification. When the query is an infrequent graph, FGindex produces a candidate answer set which is close to the exact answer set. Since an infrequent graph means the graph occurs in only a small number of graphs in the database, the number of subgraph isomorphism tests is small. To ensure that the index fits into the main memory, we propose a new notion of δTolerance Closed Frequent Graphs (δTCFGs), which allows us to flexibly tune the size of the index in a parameterized way. Our extensive experiments verify that query processing using FGindex is orders of magnitude more efficient than using the stateoftheart graph index.
NetProbe: A fast and scalable system for fraud detection in online auction networks
 In Proceedings of the 16th international Conference on the World Wide Web
, 2007
"... Given a large online network of online auction users and their histories of transactions, how can we spot anomalies and auction fraud? This paper describes the design and implementation of NetProbe, a system that we propose for solving this problem. NetProbe models auction users and transactions as ..."
Abstract

Cited by 73 (27 self)
 Add to MetaCart
(Show Context)
Given a large online network of online auction users and their histories of transactions, how can we spot anomalies and auction fraud? This paper describes the design and implementation of NetProbe, a system that we propose for solving this problem. NetProbe models auction users and transactions as a Markov Random Field tuned to detect the suspicious patterns that fraudsters create, and employs a Belief Propagation mechanism to detect likely fraudsters. Our experiments show that NetProbe is both efficient and effective for fraud detection. We report experiments on synthetic graphs with as many as 7,000 nodes and 30,000 edges, where NetProbe was able to spot fraudulent nodes with over 90 % precision and recall, within a matter of seconds. We also report experiments on a real dataset crawled from eBay,