Results 1  10
of
19
Outlier Ensembles [Position Paper]
"... Ensemble analysis is a widely used metaalgorithm for many data mining problems such as classification and clustering. Numerous ensemblebased algorithms have been proposed in the literature for these problems. Compared to the clustering and classification problems, ensemble analysis has been studie ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
(Show Context)
Ensemble analysis is a widely used metaalgorithm for many data mining problems such as classification and clustering. Numerous ensemblebased algorithms have been proposed in the literature for these problems. Compared to the clustering and classification problems, ensemble analysis has been studied in a limited way in the outlier detection literature. In some cases, ensemble analysis techniques have been implicitly used by many outlier analysis algorithms, but the approach is often buried deep into the algorithm and not formally recognized as a generalpurpose metaalgorithm. This is in spite of the fact that this problem is rather important in the context of outlier analysis. This paper discusses the various methods which are used in the literature for outlier ensembles and the general principles by which such analysis can be made more effective. A discussion is also provided on how outlier ensembles relate to the ensembletechniques used commonly for other data mining problems. 1.
Evolutionary Network Analysis: A Survey
"... Evolutionary network analysis has found an increasing interest in the literature because of the importance of different kinds of dynamic social networks, email networks, biological networks, and social streams. When a network evolves, the results of data mining algorithms such as community detection ..."
Abstract

Cited by 6 (1 self)
 Add to MetaCart
Evolutionary network analysis has found an increasing interest in the literature because of the importance of different kinds of dynamic social networks, email networks, biological networks, and social streams. When a network evolves, the results of data mining algorithms such as community detection need to be correspondingly updated. Furthermore, the specific kinds of changes to the structure of the network, such as the impact on community structure or the impact on network structural parameters, such as node degrees, also needs to be analyzed. Some dynamic networks have a much faster rate of edge arrival and are referred to as network streams or graph streams. The analysis of such networks is especially challenging, because it needs to be performed with an online approach, under the onepass constraint of data streams. The incorporation of content can add further complexity to the evolution analysis process. This survey provides an overview of the vast literature on graph evolution analysis and the numerous applications that arise in different contexts.
Accelerating Dynamic Time Warping Clustering with a Novel Admissible Pruning Strategy
 In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’15
, 2015
"... Clustering time series is a useful operation in its own right, and an important subroutine in many higherlevel data mining analyses, including data editing for classifiers, summarization, and outlier detection. While it has been noted that the general superiority of Dynamic Time Warping (DTW) over ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
Clustering time series is a useful operation in its own right, and an important subroutine in many higherlevel data mining analyses, including data editing for classifiers, summarization, and outlier detection. While it has been noted that the general superiority of Dynamic Time Warping (DTW) over Euclidean Distance for similarity search diminishes as we consider ever larger datasets, as we shall show, the same is not true for clustering. Thus, clustering time series under DTW remains a computationally challenging task. In this work, we address this lethargy in two ways. We propose a novel pruning strategy that exploits both upper and lower bounds to prune off a large fraction of the expensive distance calculations. This pruning strategy is admissible; giving us provably identical results to the brute force algorithm, but is at least an order of magnitude faster. For datasets where even this level of speedup is inadequate, we show that we can use a simple heuristic to order the unavoidable calculations in a mostusefulfirst ordering, thus casting the clustering as an anytime algorithm. We demonstrate the utility of our ideas with both single and multidimensional case studies in the domains of astronomy, speech physiology, medicine and entomology.
Symmetric Submodular Clustering with Actionable Constraint
 DISCML NIPS
, 2014
"... ar ..."
(Show Context)
Fast Efficient Clustering Algorithm for Balanced Data
"... Abstract—The Cluster analysis is a major technique for statistical analysis, machine learning, pattern recognition, data mining, image analysis and bioinformatics. Kmeans algorithm is one of the most important clustering algorithms. However, the kmeans algorithm needs a large amount of computation ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
Abstract—The Cluster analysis is a major technique for statistical analysis, machine learning, pattern recognition, data mining, image analysis and bioinformatics. Kmeans algorithm is one of the most important clustering algorithms. However, the kmeans algorithm needs a large amount of computational time for handling large data sets. In this paper, we developed more efficient clustering algorithm to overcome this deficiency named Fast Balanced kmeans (FBKmeans). This algorithm is not only yields the best clustering results as in the kmeans algorithm but also requires less computational time. The algorithm is working well in the case of balanced data. Keywords—Clustering; Kmeans algorithm; Bee algorithm; GA algorithm; FBKmeans algorithm
Standardized Mutual Information for Clustering Comparisons: One Step Further in Adjustment for Chance
"... Mutual information is a very popular measure for comparing clusterings. Previous work has shown that it is beneficial to make an adjustment for chance to this measure, by subtracting an expected value and normalizing via an upper bound. This yields the constant baseline property that enhances intu ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
Mutual information is a very popular measure for comparing clusterings. Previous work has shown that it is beneficial to make an adjustment for chance to this measure, by subtracting an expected value and normalizing via an upper bound. This yields the constant baseline property that enhances intuitiveness. In this paper, we argue that a further type of statistical adjustment for the mutual information is also beneficial – an adjustment to correct selection bias. This type of adjustment is useful when carrying out many clustering comparisons, to select one or more preferred clusterings. It reduces the tendency for the mutual information to choose clustering solutions i) with more clusters, or ii) induced on fewer data points, when compared to a reference one. We term our new adjusted measure the standardized mutual information. It requires computation of the variance of mutual information under a hypergeometric model of randomness, which is technically challenging. We derive an analytical formula for this variance and analyze its complexity. We then experimentally assess how our new measure can address selection bias and also increase interpretability. We recommend using the standardized mutual information when making multiple clustering comparisons in situations where the number of records is small compared to the number of clusters considered. 1.
On the Equivalence of PLSI and Projected Clustering [Position Paper]
"... The problem of projected clustering was first proposed in the ACM SIGMOD Conference in 1999, and the Probabilistic Latent Semantic Indexing (PLSI) technique was independently proposed in the ACM SIGIR Conference in the same year. Since then, more than two thousand papers have been written on these p ..."
Abstract
 Add to MetaCart
(Show Context)
The problem of projected clustering was first proposed in the ACM SIGMOD Conference in 1999, and the Probabilistic Latent Semantic Indexing (PLSI) technique was independently proposed in the ACM SIGIR Conference in the same year. Since then, more than two thousand papers have been written on these problems by the database, data mining and information retrieval communities, along completely independent lines of work. In this paper, we show that these two problems are essentially equivalent, under a probabilistic interpretation to the projected clustering problem. We will show that the EMalgorithm, when applied to the probabilistic version of the projected clustering problem, can be almost identically interpreted as the PLSI technique. The implications of this equivalence are significant, in that they imply the crossusability of many of the techniques which have been developed for these problems over the last decade. We hope that our observations about the equivalence of these problems will stimulate further research which can significantly improve the currently available solutions for either of these problems. 1.
Largescale Analysis of Event Data
"... heidelberg.de With the availability of numerous sources and the development of sophisticated text analysis and information retrieval techniques, more and more spatiotemporal data are extracted from texts such as news documents or social network data. Temporal and geographic information obtained ..."
Abstract
 Add to MetaCart
(Show Context)
heidelberg.de With the availability of numerous sources and the development of sophisticated text analysis and information retrieval techniques, more and more spatiotemporal data are extracted from texts such as news documents or social network data. Temporal and geographic information obtained this way often form some kind of event, describing when and where something happened. An important task in the context of business intelligence and document exploration applications is the correlation of events in terms of their temporal, geographic or even semantic properties. In this paper we discuss the tasks related to event data analysis, ranging from the extraction of events to determining events that are similar in terms of space and time by using skyline processing and clustering. We present a framework implemented in Apache Spark that provides operators supporting these tasks and thus allows to build analysis pipelines. 1.
Revealing cell assemblies at multiple levels of granularity
"... Background: Current neuronal monitoring techniques, such as calcium imaging and multielectrode arrays, enable recordings of spiking activity from hundreds of neurons simultaneously. Of primary importance in systems neuroscience is the identification of cell assemblies: groups of neurons that cooper ..."
Abstract
 Add to MetaCart
(Show Context)
Background: Current neuronal monitoring techniques, such as calcium imaging and multielectrode arrays, enable recordings of spiking activity from hundreds of neurons simultaneously. Of primary importance in systems neuroscience is the identification of cell assemblies: groups of neurons that cooperate in some form within the recorded population. New Method: We introduce a simple, integrated framework for the detection of cellassemblies from spiking data without a priori assumptions about the size or number of groups present. We define a biophysicallyinspired measure to extract a directed functional connectivity matrix between both excitatory and inhibitory neurons based on their spiking history. The resulting network representation is analyzed using the Markov Stability framework, a graph theoretical method for community detection across scales, to reveal groups of neurons that are significantly related in the recorded timeseries at different levels of granularity. Results and comparison with existing methods: Using synthetic spiketrains, including simulated data from leakyintegrateandfire networks, our method is able to identify important patterns in the data such as hierarchical structure that are missed by other standard methods. We further apply the method to experimental data from retinal ganglion cells of mouse and salamander, in which we identify cellgroups that correspond to known functional types, and to hippocampal recordings from rats exploring a linear track, where we detect place cells with high fidelity. Conclusions: We present a versatile method to detect neural assemblies in spiking data applicable across a spectrum of relevant scales that contributes to understanding spatiotemporal information gathered from systems neuroscience experiments. 1.