Results 1 
5 of
5
Efficient Processing of Exact Topk Queries over Sorted Lists
, 2010
"... The topk query is employed in a wide range of applications to generate a ranked list of data that have the highest aggregate scores over certain attributes. As the pool of attributes for selection by individual queries may be large, the data are indexed with perattribute sorted lists, and a thresho ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
The topk query is employed in a wide range of applications to generate a ranked list of data that have the highest aggregate scores over certain attributes. As the pool of attributes for selection by individual queries may be large, the data are indexed with perattribute sorted lists, and a threshold algorithm is applied on the lists involved in each query. The threshold algorithm executes in two phases – find a cutoff threshold for the topk result scores, then evaluate all the records that could score above the threshold. In this paper, we focus on exact topk queries that involve monotonic linear scoring functions over diskresident sorted lists. We introduce a model for estimating the depths to which each sorted list needs to be processed
OUTLIER DETECTION FOR INFORMATION NETWORKS BY
"... The study of networks has emerged in diverse disciplines as a means of analyzing complex relationship data. There has been a significant amount of work in network science which studies properties of networks, querying over networks, link analysis, influence propagation, network optimization, and ma ..."
Abstract
 Add to MetaCart
(Show Context)
The study of networks has emerged in diverse disciplines as a means of analyzing complex relationship data. There has been a significant amount of work in network science which studies properties of networks, querying over networks, link analysis, influence propagation, network optimization, and many other forms of network analysis. Only recently has there been some work in the area of outlier detection for information network data. Outlier (or anomaly) detection is a very broad field and has been studied in the context of a large number of application domains. Many algorithms have been proposed for outlier detection in highdimensional data, uncertain data, stream data and time series data. By its inherent nature, network data provides very different challenges that need to be addressed in a special way. Network data is gigantic, contains nodes of different types, rich nodes with associated attribute data, noisy attribute data, noisy link data, and is dynamically evolving in multiple ways. This thesis focuses on outlier detection for such networks with respect to two interesting perspectives: (1) community based outliers and (2) query based outliers. For community based outliers, we discuss the problem in both static as well as dynamic settings.
TopK Interesting Subgraph Discovery in Information Networks
"... Abstract—In the real world, various systems can be modeled using heterogeneous networks which consist of entities of different types. Many problems on such networks can be mapped to an underlying critical problem of discovering topK subgraphs of entities with rare and surprising associations. Answ ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract—In the real world, various systems can be modeled using heterogeneous networks which consist of entities of different types. Many problems on such networks can be mapped to an underlying critical problem of discovering topK subgraphs of entities with rare and surprising associations. Answering such subgraph queries efficiently involves two main challenges: (1) computing all matching subgraphs which satisfy the query and (2) ranking such results based on the rarity and the interestingness of the associations among entities in the subgraphs. Previous work on the matching problem can be harnessed for a naı̈ve rankingaftermatching solution. However, for large graphs, subgraph queries may have enormous number of matches, and so it is inefficient to compute all matches when only the topK matches are desired. In this paper, we address the two challenges of matching and ranking in topK subgraph discovery as follows. First, we introduce two index structures for the network: topology index, and graph maximum metapath weight index, which are both computed offline. Second, we propose novel topK mechanisms to exploit these indexes for answering interesting subgraph queries online efficiently. Experimental results on several synthetic datasets and the DBLP and Wikipedia datasets containing thousands of entities show the efficiency and the effectiveness of the proposed approach in computing interesting subgraphs. I.
Optimal Enumeration: Efficient Topk Tree Matching
"... Driven by many real applications, graph pattern matching has attracted a great deal of attention recently. Consider that a twigpattern matching may result in an extremely large number of matches in a graph; this may not only confuse users by providing too many results but also lead to high computa ..."
Abstract
 Add to MetaCart
(Show Context)
Driven by many real applications, graph pattern matching has attracted a great deal of attention recently. Consider that a twigpattern matching may result in an extremely large number of matches in a graph; this may not only confuse users by providing too many results but also lead to high computational costs. In this paper, we study the problem of topk tree pattern matching; that is, given a rooted tree T, compute its topk matches in a directed graph G based on the twigpattern matching semantics. We firstly present a novel and optimal enumeration paradigm based on the principle of Lawler’s procedure. We show that our enumeration algorithm runs in O(nT + log k) time in each round where nT is the number of nodes in T. Considering that the time complexity to output a match of T is O(nT) and nT ≥ log k in practice, our enumeration technique is optimal. Moreover, the cost of generating top1 match of T in our algorithm is O(mR) where mR is the number of edges in the transitive closure of a data graph G involving all relevant nodes to T. O(mR) is also optimal in the worst case without preknowledge of G. Consequently, our algorithm is optimal with the running time O(mR + k(nT + log k)) in contrast to the time complexity O(mR log k+knT (log k+dT)) of the existing technique where dT is the maximal node degree in T. Secondly, a novel priority based access technique is proposed, which greatly reduces the number of edges accessed and results in a significant performance improvement. Finally, we apply our techniques to the general form of topk graph pattern matching problem (i.e., query is a graph) to improve the existing techniques. Comprehensive empirical studies demonstrate that our techniques may improve the existing techniques by orders of magnitude. 1.
Optimal Enumeration: Efficient Topk Tree Matching
"... Driven by many real applications, graph pattern matching has attracted a great deal of attention recently. Consider that a twigpattern matching may result in an extremely large number of matches in a graph; this may not only confuse users by providing too many results but also lead to high computa ..."
Abstract
 Add to MetaCart
(Show Context)
Driven by many real applications, graph pattern matching has attracted a great deal of attention recently. Consider that a twigpattern matching may result in an extremely large number of matches in a graph; this may not only confuse users by providing too many results but also lead to high computational costs. In this paper, we study the problem of topk tree pattern matching; that is, given a rooted tree T, compute its topk matches in a directed graph G based on the twigpattern matching semantics. We firstly present a novel and optimal enumeration paradigm based on the principle of Lawler’s procedure. We show that our enumeration algorithm runs in O(nT + log k) time in each round where nT is the number of nodes in T. Considering that the time complexity to output a match of T is O(nT) and nT ≥ log k in practice, our enumeration technique is optimal. Moreover, the cost of generating top1 match of T in our algorithm is O(mR) where mR is the number of edges in the transitive closure of a data graph G involving all relevant nodes to T. O(mR) is also optimal in the worst case without preknowledge of G. Consequently, our algorithm is optimal with the running time O(mR + k(nT + log k)) in contrast to the time complexity O(mR log k+knT (log k+dT)) of the existing technique where dT is the maximal node degree in T. Secondly, a novel priority based access technique is proposed, which greatly reduces the number of edges accessed and results in a significant performance improvement. Finally, we apply our techniques to the general form of topk graph pattern matching problem (i.e., query is a graph) to improve the existing techniques. Comprehensive empirical studies demonstrate that our techniques may improve the existing techniques by orders of magnitude. 1.