Results 21  30
of
408
Measuring data abstraction quality in multiresolution visualization
 IEEE InfoVis
, 2006
"... Data abstraction techniques are widely used in multiresolution visualization systems to reduce visual clutter and facilitate analysis from overview to detail. However, analysts are usually unaware of how well the abstracted data represent the original dataset, which can impact the reliability of r ..."
Abstract

Cited by 26 (5 self)
 Add to MetaCart
(Show Context)
Data abstraction techniques are widely used in multiresolution visualization systems to reduce visual clutter and facilitate analysis from overview to detail. However, analysts are usually unaware of how well the abstracted data represent the original dataset, which can impact the reliability of results gleaned from the abstractions. In this thesis, we define three types of data abstraction quality measures for computing the degree to which the abstraction conveys the original dataset: the Histogram Difference Measure, the Nearest Neighbor Measure and Statistical Measure. They have been integrated within XmdvTool, a publicdomain multiresolution visualization system for multivariate data analysis that supports sampling as well as clustering to simplify data. Several interactive operations are provided, including adjusting the data abstraction level, changing selected regions, and setting the acceptable data abstraction quality level. Conducting these operations, analysts can select an optimal data abstraction level.
Scalable KMeans++
"... Over half a century old and showing no signs of aging, kmeans remains one of the most popular data processing algorithms. As is wellknown, a proper initialization of kmeans is crucial for obtaining a good final solution. The recently proposed kmeans++ initialization algorithm achieves this, obta ..."
Abstract

Cited by 24 (2 self)
 Add to MetaCart
(Show Context)
Over half a century old and showing no signs of aging, kmeans remains one of the most popular data processing algorithms. As is wellknown, a proper initialization of kmeans is crucial for obtaining a good final solution. The recently proposed kmeans++ initialization algorithm achieves this, obtaining an initial set of centers that is provably close to the optimum solution. A major downside of the kmeans++ is its inherent sequential nature, which limits its applicability to massive data: one must make k passes over the data to find a good initial set of centers. In this work we show how to drastically reduce the number of passes needed to obtain, in parallel, a good initialization. This is unlike prevailing efforts on parallelizing kmeans that have mostly focused on the postinitialization phases of kmeans. We prove that our proposed initialization algorithm kmeans obtains a nearly optimal solution after a logarithmic number of passes, and then show that in practice a constant number of passes suffices. Experimental evaluation on realworld largescale data demonstrates that kmeans  outperforms kmeans++ in both sequential and parallel settings. 1.
Data Mining Methods for Network Intrusion Detection
, 2004
"... Network intrusion detection systems have become a standard component in security infrastructures. Unfortunately, current systems are poor at detecting novel attacks without an unacceptable level of false alarms. We propose that the solution to this problem is the application of an ensemble of data m ..."
Abstract

Cited by 22 (1 self)
 Add to MetaCart
Network intrusion detection systems have become a standard component in security infrastructures. Unfortunately, current systems are poor at detecting novel attacks without an unacceptable level of false alarms. We propose that the solution to this problem is the application of an ensemble of data mining techniques which can be applied to network connection data in an offline environment, augmenting existing realtime sensors. In this paper, we expand on our motivation, particularly with regard to running in an offline environment, and our interest in multisensor and multimethod correlation. We then review existing systems, from commercial systems, to research based intrusion detection systems. Next we survey the state of the art in the area. Standard datasets and feature extraction turned out to be more important than we had initially anticipated, so each can be found under its own heading. Next, we review the actual data mining methods that have been proposed or implemented. We conclude by summarizing the open problems in this area, along with some questions of a broader scope. We hope that by providing the motivation and summarizing the work in this area that we can stimulate further research.
kmeans has polynomial smoothed complexity
 IN PROC. OF THE 50TH FOCS (ATLANTA, USA
, 2009
"... The kmeans method is one of the most widely used clustering algorithms, drawing its popularity from its speed in practice. Recently, however, it was shown to have exponential worstcase running time. In order to close the gap between practical performance and theoretical analysis, the kmeans metho ..."
Abstract

Cited by 21 (3 self)
 Add to MetaCart
(Show Context)
The kmeans method is one of the most widely used clustering algorithms, drawing its popularity from its speed in practice. Recently, however, it was shown to have exponential worstcase running time. In order to close the gap between practical performance and theoretical analysis, the kmeans method has been studied in the model of smoothed analysis. But even the smoothed analyses so far are unsatisfactory as the bounds are still superpolynomial in the number n of data points. In this paper, we settle the smoothed running time of the kmeans method. We show that the smoothed number of iterations is bounded by a polynomial in n and 1/σ, where σ is the standard deviation of the Gaussian perturbations. This means that if an arbitrary input data set is randomly perturbed, then the kmeans method will run in expected polynomial time on that input set.
Worstcase and smoothed analysis of the ICP algorithm, with an application to the kmeans method
 In Proc. of the 47th Ann. IEEE Symp. on Foundations of Computer Science (FOCS
, 2006
"... We show a worstcase lower bound and a smoothed upper bound on the number of iterations performed by the Iterative Closest Point (ICP) algorithm. First proposed by Besl and McKay, the algorithm is widely used in computational geometry where it is known for its simplicity and its observed speed. The ..."
Abstract

Cited by 20 (3 self)
 Add to MetaCart
(Show Context)
We show a worstcase lower bound and a smoothed upper bound on the number of iterations performed by the Iterative Closest Point (ICP) algorithm. First proposed by Besl and McKay, the algorithm is widely used in computational geometry where it is known for its simplicity and its observed speed. The theoretical study of ICP was initiated by Ezra, Sharir and Efrat, who bounded its worstcase running time between Ω(n log n) and O(n 2 d) d. We substantially tighten this gap by improving the lower bound to Ω(n/d) d+1. To help reconcile this bound with the algorithm’s observed speed, we also show the smoothed complexity of ICP is polynomial, independent of the dimensionality of the data. Using similar methods, we improve the best known smoothed upper bound for the popular kmeans method to n O(k) , once again independent of the dimension. 1.
A SURVEY OF ALGORITHMS FOR DENSE SUBGRAPH DISCOVERY
"... In this chapter, we present a survey of algorithms for dense subgraph discovery. The problem of dense subgraph discovery is closely related to clustering though the two problems also have a number of differences. For example, the problem of clustering is largely concerned with that of finding a fixe ..."
Abstract

Cited by 20 (1 self)
 Add to MetaCart
In this chapter, we present a survey of algorithms for dense subgraph discovery. The problem of dense subgraph discovery is closely related to clustering though the two problems also have a number of differences. For example, the problem of clustering is largely concerned with that of finding a fixed partition in the data, whereas the problem of dense subgraph discovery defines these dense components in a much more flexible way. The problem of dense subgraph discovery may wither be defined over single or multiple graphs. We explore both cases. In the latter case, the problem is also closely related to the problem of the frequent subgraph discovery. This chapter will discuss and organize the literature on this topic effectively in order to make it much more accessible to the reader.
Fast Agglomerative Clustering for Rendering
"... Hierarchical representations of large data sets, such as binary cluster trees, are a crucial component in many scalable algorithms used in various fields. Two major approaches for building these trees are agglomerative, or bottomup, clustering and divisive, or topdown, clustering. The agglomerativ ..."
Abstract

Cited by 18 (9 self)
 Add to MetaCart
(Show Context)
Hierarchical representations of large data sets, such as binary cluster trees, are a crucial component in many scalable algorithms used in various fields. Two major approaches for building these trees are agglomerative, or bottomup, clustering and divisive, or topdown, clustering. The agglomerative approach offers some real advantages such as more flexible clustering and often produces higher quality trees, but has been little used in graphics because it is frequently assumed to be prohibitively expensive (O(N2) or worse). In this paper we show that agglomerative clustering can be done efficiently even for very large data sets. We introduce a novel locallyordered algorithm that is faster than traditional heapbased agglomerative clustering and show that the complexity of the tree build time is much closer to linear than quadratic. We also evaluate the quality of the agglomerative clustering trees compared to the best known divisive clustering strategies in two sample applications: bounding volume hierarchies for ray tracing and light trees in the Lightcuts rendering algorithm. Tree quality is highly application, data set, and dissimilarity function specific. In our experiments the agglomerativebuilt tree quality is consistently higher by margins ranging from slight to significant, with up to 35 % reduction in tree query times.
R.: A phonological expression for physical movement monitoring in body sensor networks
 2008 5th IEEE International Conference on Mobile Ad Hoc and Sensor Systems
, 2008
"... Monitoring human activities using wearable wireless sensor nodes has the potential to enable many useful applications for everyday situations. The deployment of a compact and computationally efficient grammatical representation of actions reduces the complexities involved in the detection and recogn ..."
Abstract

Cited by 17 (10 self)
 Add to MetaCart
(Show Context)
Monitoring human activities using wearable wireless sensor nodes has the potential to enable many useful applications for everyday situations. The deployment of a compact and computationally efficient grammatical representation of actions reduces the complexities involved in the detection and recognition of human behaviors in a distributed system. In this paper, we introduce a road map to a linguistic framework for the symbolic representation of inertial information for physical movement monitoring. Our method for creating phonetic descriptions consists of constructing primitives across the network and assigning certain primitives to each movement. Our technique exploits the notion of a decision tree to identify atomic actions corresponding to every given movement. We pose an optimization problem for the fast identification of primitives. We then prove that this problem is NPComplete and provide a fast greedy algorithm to approximate the solution. Finally, we demonstrate the effectiveness of our phonetic model on data collected from three subjects. 1.