Results 1 - 10
of
15
Detecting significant multidimensional spatial clusters
- Advances in Neural Information Processing Systems 17
, 2005
"... Assume a uniform, multidimensional grid of bivariate data, where each cell of the grid has a count ci and a baseline bi. Our goal is to find spatial regions (d-dimensional rectangles) where the ci are significantly higher than expected given bi. We focus on two applications: detection of clusters of ..."
Abstract
-
Cited by 14 (7 self)
- Add to MetaCart
Assume a uniform, multidimensional grid of bivariate data, where each cell of the grid has a count ci and a baseline bi. Our goal is to find spatial regions (d-dimensional rectangles) where the ci are significantly higher than expected given bi. We focus on two applications: detection of clusters of disease cases from epidemiological data (emergency department visits, over-the-counter drug sales), and discovery of regions of increased brain activity corresponding to given cognitive tasks (from fMRI data). Each of these problems can be solved using a spatial scan statistic (Kulldorff, 1997), where we compute the maximum of a likelihood ratio statistic over all spatial regions, and find the significance of this region by randomization. However, computing the scan statistic for all spatial regions is generally computationally infeasible, so we introduce a novel fast spatial scan algorithm, generalizing the 2D scan algorithm of (Neill and Moore, 2004) to arbitrary dimensions. Our new multidimensional multiresolution algorithm allows us to find spatial clusters up to 1400x faster than the naive spatial scan, without any loss of accuracy. 1
An information-theoretic approach to detecting changes in multi-dimensional data streams
- In Proc. Symp. on the Interface of Statistics, Computing Science, and Applications
, 2006
"... Abstract An important problem in processing large data streams is detecting changes in the underly-ing distribution that generates the data. The challenge in designing change detection schemes is making them general, scalable, and statistically sound. In this paper, we take a general,information-the ..."
Abstract
-
Cited by 13 (1 self)
- Add to MetaCart
Abstract An important problem in processing large data streams is detecting changes in the underly-ing distribution that generates the data. The challenge in designing change detection schemes is making them general, scalable, and statistically sound. In this paper, we take a general,information-theoretic approach to the change detection problem, which works for multidimensional as well as categorical data. We use relative entropy, also called the Kullback-Leiblerdistance, to measure the difference between two given distributions. The KL-distance is known to be related to the optimal error in determining whether the two distributions are the sameand draws on fundamental results in hypothesis testing. The KL-distance also generalizes traditional distance measures in statistics, and has invariance properties that make it ideally suitedfor comparing distributions. Our scheme is general; it is nonparametric and requires no assumptions on the underlyingdistributions. It employs a statistical inference procedure based on the theory of bootstrapping, which allows us to determine whether our measurements are statistically significant. The schemeis also quite flexible from a practical perspective; it can be implemented using any spatial partitioning scheme that scales well with dimensionality. In addition to providing change detections,our method generalizes Kulldorff's spatial scan statistic, allowing us to quantitatively identify specific regions in space where large changes have occurred.We provide a detailed experimental study that demonstrates the generality and efficiency of our approach with different kinds of multidimensional datasets, both synthetic and real. 1 Introduction We are collecting and storing data in unprecedented quantities and varieties--streams, images, audio, text, metadata descriptions, and even simple numbers. Over time, these data streams change as the underlying processes that generate them change. Some changes are spurious and pertain to glitches in the data. Some are genuine, caused by changes in the underlying distributions. Some changes are gradual and some are more precipitous. We would like to detect changes in a variety of settings:
The Hunting of the Bump: On Maximizing Statistical Discrepancy
- In SODA
, 2006
"... Anomaly detection has important applications in biosurveilance and environmental monitoring. When comparing measured data to data drawn from a baseline distribution, merely, finding clusters in the measured data may not actually represent true anomalies. These clusters may likely be the clusters of ..."
Abstract
-
Cited by 8 (4 self)
- Add to MetaCart
Anomaly detection has important applications in biosurveilance and environmental monitoring. When comparing measured data to data drawn from a baseline distribution, merely, finding clusters in the measured data may not actually represent true anomalies. These clusters may likely be the clusters of the baseline distribution. Hence, a discrepancy function is often used to examine how different measured data is to baseline data within a region. An anomalous region is thus defined to be one with high discrepancy. In this paper, we present algorithms for maximizing statistical discrepancy functions over the space of axis-parallel rectangles. We give provable approximation guarantees, both additive and relative, and our methods apply to any convex discrepancy function. Our algorithms work by connecting statistical discrepancy to combinatorial discrepancy; roughly speaking, we show that in order to maximize a convex discrepancy function over a class of shapes, one needs only maximize a linear discrepancy function over the same set of shapes. We derive general discrepancy functions for data generated from a one- parameter exponential family. This generalizes the widely-used Kulldorff scan statistic for data from a Poisson distribution. We present an algorithm run-ning in O ( 1 ɛ n2 log 2 n) that computes the maximum discrepancy rectangle to within additive error ɛ, for the Kulldorff scan statistic. Similar results hold for relative error and for discrepancy functions for data coming from Gaussian, Bernoulli, and gamma distributions. Prior to our work, the best known algorithms were exact and ran in time O(n 4). 1
Spatial scan statistics: Approximations and performance study
- In Proceedings 12th ACM SIGKDD Knowledge Discovery & Data Mining
, 1991
"... Spatial scan statistics are used to determine hotspots in spatial data, and are widely used in epidemiology and biosurveillance. In recent years, there has been much effort invested in designing efficient algorithms for finding such “high discrepancy ” regions, with methods ranging from fast heurist ..."
Abstract
-
Cited by 7 (2 self)
- Add to MetaCart
Spatial scan statistics are used to determine hotspots in spatial data, and are widely used in epidemiology and biosurveillance. In recent years, there has been much effort invested in designing efficient algorithms for finding such “high discrepancy ” regions, with methods ranging from fast heuristics for special cases, to general grid-based methods, and to efficient approximation algorithms with provable guarantees on performance and quality. In this paper, we make a number of contributions to the computational study of spatial scan statistics. First, we describe a simple exact algorithm for finding the largest discrepancy region in a domain. Second, we propose a new approximation algorithm for a large class of discrepancy functions (including the Kulldorff scan statistic) that improves the approximation versus runtime trade-off of prior methods. Third, we extend our simple exact and our approximation algorithms to data sets which lie naturally on a grid or are accumulated onto a grid. Fourth, we conduct a detailed experimental comparison of these methods with a number of known methods, demonstrating that our approximation algorithm has far superior performance in practice to prior methods, and exhibits a good performance-accuracy trade-off. All extant methods (including those in this paper) are suitable for data sets that are modestly sized; if data sets are of the order of millions of data points, none of these methods scale well. For such massive data settings, it is natural to examine whether small-space streaming algorithms might yield accurate answers. Here, we provide some negative results, showing that any streaming algorithms that even provide approximately optimal answers to the discrepancy maximization problem must use space linear in the input.
Algorithms for ε-approximation of terrains
, 2008
"... Consider a point set D with a measure functionµ: D→R. Let A be the set of subsets of D induced by containment in a shape from some geometric family (e.g. axis-parallel rectangles, half planes, balls, k-oriented polygons). We say a range space (D, A) has anε-approximation P if ..."
Abstract
-
Cited by 6 (6 self)
- Add to MetaCart
Consider a point set D with a measure functionµ: D→R. Let A be the set of subsets of D induced by containment in a shape from some geometric family (e.g. axis-parallel rectangles, half planes, balls, k-oriented polygons). We say a range space (D, A) has anε-approximation P if
Statistical Change Detection for Multi-Dimensional Data ABSTRACT
"... This paper deals with detecting change of distribution in multi-dimensional data sets. For a given baseline data set and a set of newly observed data points, we define a statistical test called the density test for deciding if the observed data points are sampled from the underlying distribution tha ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
This paper deals with detecting change of distribution in multi-dimensional data sets. For a given baseline data set and a set of newly observed data points, we define a statistical test called the density test for deciding if the observed data points are sampled from the underlying distribution that produced the baseline data set. We define a test statistic that is strictly distribution-free under the null hypothesis. Our experimental results show that the density test has substantially more power than the two existing methods for multi-dimensional change detection.
Fast Subset Scan for Spatial Pattern Detection
- J. Royal Statistical Society B
"... Summary. We propose a new ‘fast subset scan ’ approach for accurate and computationally efficient event detection in massive data sets. We treat event detection as a search over subsets of data records, finding the subset which maximizes some score function. We prove that many commonly used function ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
Summary. We propose a new ‘fast subset scan ’ approach for accurate and computationally efficient event detection in massive data sets. We treat event detection as a search over subsets of data records, finding the subset which maximizes some score function. We prove that many commonly used functions (e.g. Kulldorff’s spatial scan statistic and extensions) satisfy the ‘linear time subset scanning ’ property, enabling exact and efficient optimization over subsets. In the spatial setting, we demonstrate that proximity-constrained subset scans substantially improve the timeliness and accuracy of event detection, detecting emerging outbreaks of disease 2 days faster than existing methods. Keywords: Algorithms; Disease surveillance; Event detection; Scan statistics; Spatial scan
Spatial Scan Statistics for Graph Clustering
"... In this paper, we present a measure associated with detection and inference of statistically anomalous clusters of a graph based on the likelihood test of observed and expected edges in a subgraph. This measure is adapted from spatial scan statistics for point sets and provides quantitative assessme ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
In this paper, we present a measure associated with detection and inference of statistically anomalous clusters of a graph based on the likelihood test of observed and expected edges in a subgraph. This measure is adapted from spatial scan statistics for point sets and provides quantitative assessment for clusters. We discuss some important properties of this statistic and its relation to modularity and Bregman divergences. We apply a simple clustering algorithm to find clusters with large values of this measure in a variety of real-world data sets, and we illustrate its ability to identify statistically significant clusters of selected granularity. 1 Introduction. Numerous techniques have been proposed for identifying clusters in large networks, but it has proven difficult to
Fast Algorithms for Burst Detection
, 2006
"... To our lovely baby Marvin Dedicated to my family who always support me v Acknowledgments It has been a challenging process for me to finish this dissertation while working full time. Without the help of many individuals, this dissertation would never come into shape. I warmly thank my advisor Profes ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
To our lovely baby Marvin Dedicated to my family who always support me v Acknowledgments It has been a challenging process for me to finish this dissertation while working full time. Without the help of many individuals, this dissertation would never come into shape. I warmly thank my advisor Professor Dennis Shasha. He always gives me kind encouragement and inspiring advice. Without his encouragement and help, I would not have been able to finish this work. Many thanks to Professor Richard Cole, Professor Zvi M. Kedem, Professor Bhubaneswar Mishra for their inspiring comments and suggestions on this work. I want to thank many friends in NYU: Xiaojian Zhao, Zhihua Wang, Con-gchun He and Yunyue Zhu. They have lent lots of help in sharing their resources in preparing this work.

