#### DMCA

## LOF: Identifying Density-Based Local Outliers (2000)

### Cached

### Download Links

- [people.cs.vt.edu]
- [webdocs.cs.ualberta.ca]
- [www.dbs.informatik.uni-muenchen.de]
- [www.cs.ualberta.ca]
- [www.dbs.ifi.lmu.de]
- DBLP

### Other Repositories/Bibliography

Venue: | PROCEEDINGS OF THE 2000 ACM SIGMOD INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA |

Citations: | 516 - 13 self |

### Citations

2051 |
Computational Geometry: An Introduction
- Preparata, Shamos
- 1990
(Show Context)
Citation Context ...ions used are univariate. There are some tests that are multivariate (e.g. multivariate normal outliers). But for many KDD applications, the underlying distribution is unknown. Fitting the data with standard distributions is costly, and may not produce satisfactory results. The second category of outlier studies in statistics is depth-based. Each data object is represented as a point in a k-d space, and is assigned a depth. With respect to outlier detection, outliers are more likely to be data objects with smaller depths. There are many definitions of depth that have been proposed (e.g. [20], [16]). In theory, depth-based approaches could work for large values of k. However, in practice, while there exist efficient algorithms for k = 2 or 3 ([16], [18], [12]), depth-based approaches become inefficient for large datasets for k ≥ 4. This is because depth-based approaches rely on the computation of k-d convex hulls which has a lower bound complexity of Ω(nk/2) for n objects. Recently, Knorr and Ng proposed the notion of distance-based outliers [13], [14]. Their notion generalizes many notions from the distribution-based approaches, and enjoys better computational complexity than the depth... |

1800 |
Exploratory Data Analysis”,
- Tukey
- 1977
(Show Context)
Citation Context ...tributions used are univariate. There are some tests that are multivariate (e.g. multivariate normal outliers). But for many KDD applications, the underlying distribution is unknown. Fitting the data with standard distributions is costly, and may not produce satisfactory results. The second category of outlier studies in statistics is depth-based. Each data object is represented as a point in a k-d space, and is assigned a depth. With respect to outlier detection, outliers are more likely to be data objects with smaller depths. There are many definitions of depth that have been proposed (e.g. [20], [16]). In theory, depth-based approaches could work for large values of k. However, in practice, while there exist efficient algorithms for k = 2 or 3 ([16], [18], [12]), depth-based approaches become inefficient for large datasets for k ≥ 4. This is because depth-based approaches rely on the computation of k-d convex hulls which has a lower bound complexity of Ω(nk/2) for n objects. Recently, Knorr and Ng proposed the notion of distance-based outliers [13], [14]. Their notion generalizes many notions from the distribution-based approaches, and enjoys better computational complexity than the... |

1785 | A density-based algorithm for discovering clusters in large spatial databases with noise
- Ester, Kriegel, et al.
- 1996
(Show Context)
Citation Context ...tlier is still distance-based. Given the importance of the area, fraud detection has received more attention than the general area of outlier detection. Depending on the specifics of the application domains, elaborate fraud models and fraud detection algorithms have been developed (e.g. [8], [6]). In contrast to fraud detection, the kinds of outlier detection work discussed so far are more exploratory in nature. Outlier detection may indeed lead to the construction of fraud models. Finally, most clustering algorithms, especially those developed in the context of KDD (e.g. CLARANS [15], DBSCAN [7], BIRCH [23], STING [22], WaveCluster [19], DenClue [11], CLIQUE [3]), are to some extent capable of handling exceptions. However, since the main objective of a clustering algorithm is to find clusters, they are developed to optimize clustering, and not to optimize outlier detection. The exceptions (called “noise” in the context of clustering) are typically just tolerated or ignored when producing the clustering result. Even if the outliers are not ignored, the notions of outliers are essentially binary, and there are no quantification as to how outlying an object is. Our notion of local outli... |

724 | Automatic subspace clustering of high dimensional data for data mining applications
- Agrawal, Gehrke, et al.
- 1998
(Show Context)
Citation Context ...ud detection has received more attention than the general area of outlier detection. Depending on the specifics of the application domains, elaborate fraud models and fraud detection algorithms have been developed (e.g. [8], [6]). In contrast to fraud detection, the kinds of outlier detection work discussed so far are more exploratory in nature. Outlier detection may indeed lead to the construction of fraud models. Finally, most clustering algorithms, especially those developed in the context of KDD (e.g. CLARANS [15], DBSCAN [7], BIRCH [23], STING [22], WaveCluster [19], DenClue [11], CLIQUE [3]), are to some extent capable of handling exceptions. However, since the main objective of a clustering algorithm is to find clusters, they are developed to optimize clustering, and not to optimize outlier detection. The exceptions (called “noise” in the context of clustering) are typically just tolerated or ignored when producing the clustering result. Even if the outliers are not ignored, the notions of outliers are essentially binary, and there are no quantification as to how outlying an object is. Our notion of local outliers share a few fundamental concepts with density-based clustering a... |

709 | Efficient and effective clustering methods for spatial data min ing
- Ng, Han
- 1994
(Show Context)
Citation Context ...tion of an outlier is still distance-based. Given the importance of the area, fraud detection has received more attention than the general area of outlier detection. Depending on the specifics of the application domains, elaborate fraud models and fraud detection algorithms have been developed (e.g. [8], [6]). In contrast to fraud detection, the kinds of outlier detection work discussed so far are more exploratory in nature. Outlier detection may indeed lead to the construction of fraud models. Finally, most clustering algorithms, especially those developed in the context of KDD (e.g. CLARANS [15], DBSCAN [7], BIRCH [23], STING [22], WaveCluster [19], DenClue [11], CLIQUE [3]), are to some extent capable of handling exceptions. However, since the main objective of a clustering algorithm is to find clusters, they are developed to optimize clustering, and not to optimize outlier detection. The exceptions (called “noise” in the context of clustering) are typically just tolerated or ignored when producing the clustering result. Even if the outliers are not ignored, the notions of outliers are essentially binary, and there are no quantification as to how outlying an object is. Our notion of... |

623 | A quantitative analysis and performance study for similarity-search methods in high dimensional spaces, in:
- Weber, Schek, et al.
- 1998
(Show Context)
Citation Context ...d 5d 2d ti m e [s ec ] 102 11 is O(n*time for a k-nn query). For the k-nn queries, we have a choice among different methods. For low-dimensional data, we can use a grid based approach which can answer k-nn queries in constant time, leading to a complexity of O(n) for the materialization step. For medium to medium high-dimensional data, we can use an index, which provides an average complexity of O(log n) for k-nn queries, leading to a complexity of O(n log n) for the materialization. For extremely high-dimensional data, we need to use a sequential scan or some variant of it, e.g. the VA-file ([21]), with a complexity of O(n), leading to a complexity of O(n2) for the materialization step. In our experiments, we used a variant of the X-tree ([4]), leading to the complexity of O(n log n). Figure 10 shows performance experiments for different dimensional datasets and MinPtsUB=50. The times shown do include the time to build the index. Obviously, the index works very well for 2-dimensional and 5- dimensional dataset, leading to a near linear performance, but degenerates for the 10-dimensional and 20-dimensionsal dataset. It is a well known effect of index structures, that their effectivity ... |

576 | BIRCH: an efficient data clustering method for very large databases.
- Zhang, Ramakrishnon, et al.
- 1996
(Show Context)
Citation Context ...ill distance-based. Given the importance of the area, fraud detection has received more attention than the general area of outlier detection. Depending on the specifics of the application domains, elaborate fraud models and fraud detection algorithms have been developed (e.g. [8], [6]). In contrast to fraud detection, the kinds of outlier detection work discussed so far are more exploratory in nature. Outlier detection may indeed lead to the construction of fraud models. Finally, most clustering algorithms, especially those developed in the context of KDD (e.g. CLARANS [15], DBSCAN [7], BIRCH [23], STING [22], WaveCluster [19], DenClue [11], CLIQUE [3]), are to some extent capable of handling exceptions. However, since the main objective of a clustering algorithm is to find clusters, they are developed to optimize clustering, and not to optimize outlier detection. The exceptions (called “noise” in the context of clustering) are typically just tolerated or ignored when producing the clustering result. Even if the outliers are not ignored, the notions of outliers are essentially binary, and there are no quantification as to how outlying an object is. Our notion of local outliers share a ... |

526 | OPTICS: ordering points to identify the clustering structure.
- Ankerst, Breunig
- 1999
(Show Context)
Citation Context ...ments to validate our method. In the third example, we identify meaningful outliers in a database of german soccer players, for which we happen to have a “domain expert” handy, who confirmed the meaningfulness of the outliers found. The last subsection contains performance experiments showing the practicability of our approach even for large, high-dimensional datasets. Additionally, we conducted experiments with a 64-dimensional dataset, to demonstrate that our definitions are reasonable in very high dimensional spaces. The feature vectors used are color histograms extracted from tv snapshots [2]. We indentified multiple clusters, e.g. a cluster of pictures from a tennis match, and reasonable local outliers with LOF values of up to 7. 7.1 A Synthetic Example The left side of figure 9 shows a 2-dimensional dataset containing one low density Gaussian cluster of 200 objects and three large clusters of 500 objects each. Among these three, one is a dense Gaussian cluster and the other two are uniform clusters of different densities. Furthermore, it contains a couple of outliers. On the right side of figure 9 we plot the LOF of all the objects for MinPts = 40 as a third dimension. We see th... |

431 |
Outliers in Statistical Data.
- Barnett, Lewis
- 1994
(Show Context)
Citation Context ...nd discuss ways to choose MinPts values for LOF computation. In section 7 we perform an extensive experimental evaluation. 2. RELATED WORK Most of the previous studies on outlier detection were conducted in the field of statistics. These studies can be broadly classified into two categories. The first category is distribution-based, where a standard distribution (e.g. Normal, Poisson, etc.) is used to fit the data best. Outliers are defined based on the probability distribution. Over one hundred tests of this category, called discordancy tests, have been developed for different scenarios (see [5]). A key drawback of this category of tests is that most of the distributions used are univariate. There are some tests that are multivariate (e.g. multivariate normal outliers). But for many KDD applications, the underlying distribution is unknown. Fitting the data with standard distributions is costly, and may not produce satisfactory results. The second category of outlier studies in statistics is depth-based. Each data object is represented as a point in a k-d space, and is assigned a depth. With respect to outlier detection, outliers are more likely to be data objects with smaller depths.... |

359 | Algorithms for mining distance-based outliers in large datasets.
- Knorr, Ng
- 1998
(Show Context)
Citation Context ...ost studies in KDD focus on finding patterns applicable to a considerable portion of objects in a dataset. However, for applications such as detecting criminal activities of various kinds (e.g. in electronic commerce), rare events, deviations from the majority, or exceptional cases may be more interesting and useful than the common cases. Finding such exceptions and outliers, however, has not yet received as much attention in the KDD community as some other topics have, e.g. association rules. Recently, a few studies have been conducted on outlier detection for large datasets (e.g. [18], [1], [13], [14]). While a more detailed discussion on these studies will be given in section 2, it suffices to point out here that most of these studies consider being an outlier as a binary property. That is, either an object in the dataset is an outlier or not. For many applications, the situation is more complex. And it becomes more meaningful to assign to each object a degree of being an outlier. Also related to outlier detection is an extensive body of work on clustering algorithms. From the viewpoint of a clustering algorithm, outliers are objects not located in clusters of a dataset, usually cal... |

322 | Efficient algorithms for mining outliers from large data sets.
- Ramaswamy, Rastogi, et al.
- 2000
(Show Context)
Citation Context ...approaches become inefficient for large datasets for k ≥ 4. This is because depth-based approaches rely on the computation of k-d convex hulls which has a lower bound complexity of Ω(nk/2) for n objects. Recently, Knorr and Ng proposed the notion of distance-based outliers [13], [14]. Their notion generalizes many notions from the distribution-based approaches, and enjoys better computational complexity than the depth-based approaches for larger values of k. Later in section 3, we will discuss in detail how their notion is different from the notion of local outliers proposed in this paper. In [17] the notion of distance based outliers is extended by using the distance to the k-nearest neighbor to rank the outliers. A very efficient algorithms to compute the top n outliers in this ranking is given, but their notion of an outlier is still distance-based. Given the importance of the area, fraud detection has received more attention than the general area of outlier detection. Depending on the specifics of the application domains, elaborate fraud models and fraud detection algorithms have been developed (e.g. [8], [6]). In contrast to fraud detection, the kinds of outlier detection work dis... |

293 |
Identification of Outliers.
- HAWKINS
- 1980
(Show Context)
Citation Context ...oducing the clustering result. Even if the outliers are not ignored, the notions of outliers are essentially binary, and there are no quantification as to how outlying an object is. Our notion of local outliers share a few fundamental concepts with density-based clustering approaches. However, our outlier detection method does not require any explicit or implicit notion of clusters. 3. PROBLEMS OF EXISTING (NON-LOCAL) APPROACHES As we have seen in section 2, most of the existing work in outlier detection lies in the field of statistics. Intuitively, outliers can be defined as given by Hawkins [10]. Definition 1: (Hawkins-Outlier) An outlier is an observation that deviates so much from other observations as to arouse suspicion that it was generated by a different mechanism. This notion is formalized by Knorr and Ng [13] in the following definition of outliers. Throughout this paper, we use o, p, q to denote objects in a dataset. We use the notation d(p, q) to denote the distance between objects p and q. For a set of objects, we use C (sometimes with the intuition that C forms a cluster). To simplify our notation, we use d(p, C) to denote the minimum distance between p and object q in C,... |

290 | STING: A Statistical Information Grid Approach to Spatial Data Mining”,
- Wang, Yang, et al.
- 1997
(Show Context)
Citation Context ...-based. Given the importance of the area, fraud detection has received more attention than the general area of outlier detection. Depending on the specifics of the application domains, elaborate fraud models and fraud detection algorithms have been developed (e.g. [8], [6]). In contrast to fraud detection, the kinds of outlier detection work discussed so far are more exploratory in nature. Outlier detection may indeed lead to the construction of fraud models. Finally, most clustering algorithms, especially those developed in the context of KDD (e.g. CLARANS [15], DBSCAN [7], BIRCH [23], STING [22], WaveCluster [19], DenClue [11], CLIQUE [3]), are to some extent capable of handling exceptions. However, since the main objective of a clustering algorithm is to find clusters, they are developed to optimize clustering, and not to optimize outlier detection. The exceptions (called “noise” in the context of clustering) are typically just tolerated or ignored when producing the clustering result. Even if the outliers are not ignored, the notions of outliers are essentially binary, and there are no quantification as to how outlying an object is. Our notion of local outliers share a few fundamen... |

277 | An efficient approach to clustering in large multimedia databases with noise.
- Hinneburg, Keim
- 1998
(Show Context)
Citation Context ...the area, fraud detection has received more attention than the general area of outlier detection. Depending on the specifics of the application domains, elaborate fraud models and fraud detection algorithms have been developed (e.g. [8], [6]). In contrast to fraud detection, the kinds of outlier detection work discussed so far are more exploratory in nature. Outlier detection may indeed lead to the construction of fraud models. Finally, most clustering algorithms, especially those developed in the context of KDD (e.g. CLARANS [15], DBSCAN [7], BIRCH [23], STING [22], WaveCluster [19], DenClue [11], CLIQUE [3]), are to some extent capable of handling exceptions. However, since the main objective of a clustering algorithm is to find clusters, they are developed to optimize clustering, and not to optimize outlier detection. The exceptions (called “noise” in the context of clustering) are typically just tolerated or ignored when producing the clustering result. Even if the outliers are not ignored, the notions of outliers are essentially binary, and there are no quantification as to how outlying an object is. Our notion of local outliers share a few fundamental concepts with density-based ... |

221 | WaveCluster: A Multi-Resolution Clustering Approach for Very Large Spatial Databases”,
- Sheikholeslami, Chatterjee, et al.
- 1998
(Show Context)
Citation Context ...importance of the area, fraud detection has received more attention than the general area of outlier detection. Depending on the specifics of the application domains, elaborate fraud models and fraud detection algorithms have been developed (e.g. [8], [6]). In contrast to fraud detection, the kinds of outlier detection work discussed so far are more exploratory in nature. Outlier detection may indeed lead to the construction of fraud models. Finally, most clustering algorithms, especially those developed in the context of KDD (e.g. CLARANS [15], DBSCAN [7], BIRCH [23], STING [22], WaveCluster [19], DenClue [11], CLIQUE [3]), are to some extent capable of handling exceptions. However, since the main objective of a clustering algorithm is to find clusters, they are developed to optimize clustering, and not to optimize outlier detection. The exceptions (called “noise” in the context of clustering) are typically just tolerated or ignored when producing the clustering result. Even if the outliers are not ignored, the notions of outliers are essentially binary, and there are no quantification as to how outlying an object is. Our notion of local outliers share a few fundamental concepts with ... |

186 | Knowledge Discovery and Data Mining: Towards a Unifying Framework”,
- Fayyad, Piatetsky-Shapiro, et al.
- 1996
(Show Context)
Citation Context ...approaches. Finally, a careful performance evaluation of our algorithm confirms we show that our approach of finding local outliers can be practical. Keywords Outlier Detection, Database Mining. 1. INTRODUCTION Larger and larger amounts of data are collected and stored in databases, increasing the need for efficient and effective analysis methods to make use of the information contained implicitly in the data. Knowledge discovery in databases (KDD) has been defined as the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable knowledge from the data [9]. Most studies in KDD focus on finding patterns applicable to a considerable portion of objects in a dataset. However, for applications such as detecting criminal activities of various kinds (e.g. in electronic commerce), rare events, deviations from the majority, or exceptional cases may be more interesting and useful than the common cases. Finding such exceptions and outliers, however, has not yet received as much attention in the KDD community as some other topics have, e.g. association rules. Recently, a few studies have been conducted on outlier detection for large datasets (e.g. [18], [1... |

101 | A Linear Method for Deviation Detection in Large Databases”,
- Arning, Agrawal, et al.
- 1996
(Show Context)
Citation Context ...9]. Most studies in KDD focus on finding patterns applicable to a considerable portion of objects in a dataset. However, for applications such as detecting criminal activities of various kinds (e.g. in electronic commerce), rare events, deviations from the majority, or exceptional cases may be more interesting and useful than the common cases. Finding such exceptions and outliers, however, has not yet received as much attention in the KDD community as some other topics have, e.g. association rules. Recently, a few studies have been conducted on outlier detection for large datasets (e.g. [18], [1], [13], [14]). While a more detailed discussion on these studies will be given in section 2, it suffices to point out here that most of these studies consider being an outlier as a binary property. That is, either an object in the dataset is an outlier or not. For many applications, the situation is more complex. And it becomes more meaningful to assign to each object a degree of being an outlier. Also related to outlier detection is an extensive body of work on clustering algorithms. From the viewpoint of a clustering algorithm, outliers are objects not located in clusters of a dataset, usual... |

80 | Finding Intensional Knowledge of Distance-based Outliers”,
- Knorr, Ng
- 1999
(Show Context)
Citation Context ...udies in KDD focus on finding patterns applicable to a considerable portion of objects in a dataset. However, for applications such as detecting criminal activities of various kinds (e.g. in electronic commerce), rare events, deviations from the majority, or exceptional cases may be more interesting and useful than the common cases. Finding such exceptions and outliers, however, has not yet received as much attention in the KDD community as some other topics have, e.g. association rules. Recently, a few studies have been conducted on outlier detection for large datasets (e.g. [18], [1], [13], [14]). While a more detailed discussion on these studies will be given in section 2, it suffices to point out here that most of these studies consider being an outlier as a binary property. That is, either an object in the dataset is an outlier or not. For many applications, the situation is more complex. And it becomes more meaningful to assign to each object a degree of being an outlier. Also related to outlier detection is an extensive body of work on clustering algorithms. From the viewpoint of a clustering algorithm, outliers are objects not located in clusters of a dataset, usually called no... |

79 |
Computing Depth Contours of Bivariate Point Clouds,
- Ruts, Rousseeuw
- 1996
(Show Context)
Citation Context ...data [9]. Most studies in KDD focus on finding patterns applicable to a considerable portion of objects in a dataset. However, for applications such as detecting criminal activities of various kinds (e.g. in electronic commerce), rare events, deviations from the majority, or exceptional cases may be more interesting and useful than the common cases. Finding such exceptions and outliers, however, has not yet received as much attention in the KDD community as some other topics have, e.g. association rules. Recently, a few studies have been conducted on outlier detection for large datasets (e.g. [18], [1], [13], [14]). While a more detailed discussion on these studies will be given in section 2, it suffices to point out here that most of these studies consider being an outlier as a binary property. That is, either an object in the dataset is an outlier or not. For many applications, the situation is more complex. And it becomes more meaningful to assign to each object a degree of being an outlier. Also related to outlier detection is an extensive body of work on clustering algorithms. From the viewpoint of a clustering algorithm, outliers are objects not located in clusters of a dataset, ... |

78 | Adaptive fraud detection, Data Mining and Knowledge Discovery 3
- Fawcett, Provost
- 1997
(Show Context)
Citation Context ...ion is different from the notion of local outliers proposed in this paper. In [17] the notion of distance based outliers is extended by using the distance to the k-nearest neighbor to rank the outliers. A very efficient algorithms to compute the top n outliers in this ranking is given, but their notion of an outlier is still distance-based. Given the importance of the area, fraud detection has received more attention than the general area of outlier detection. Depending on the specifics of the application domains, elaborate fraud models and fraud detection algorithms have been developed (e.g. [8], [6]). In contrast to fraud detection, the kinds of outlier detection work discussed so far are more exploratory in nature. Outlier detection may indeed lead to the construction of fraud models. Finally, most clustering algorithms, especially those developed in the context of KDD (e.g. CLARANS [15], DBSCAN [7], BIRCH [23], STING [22], WaveCluster [19], DenClue [11], CLIQUE [3]), are to some extent capable of handling exceptions. However, since the main objective of a clustering algorithm is to find clusters, they are developed to optimize clustering, and not to optimize outlier detection. The... |

62 | H.-P.: The X-tree: An index structure for high-dimensional data - Berchtold, Keim, et al. - 1996 |

51 | Fast Computation of 2-Dimensional Depth Contours”,
- Johnson, Kwok, et al.
- 1998
(Show Context)
Citation Context ... is unknown. Fitting the data with standard distributions is costly, and may not produce satisfactory results. The second category of outlier studies in statistics is depth-based. Each data object is represented as a point in a k-d space, and is assigned a depth. With respect to outlier detection, outliers are more likely to be data objects with smaller depths. There are many definitions of depth that have been proposed (e.g. [20], [16]). In theory, depth-based approaches could work for large values of k. However, in practice, while there exist efficient algorithms for k = 2 or 3 ([16], [18], [12]), depth-based approaches become inefficient for large datasets for k ≥ 4. This is because depth-based approaches rely on the computation of k-d convex hulls which has a lower bound complexity of Ω(nk/2) for n objects. Recently, Knorr and Ng proposed the notion of distance-based outliers [13], [14]. Their notion generalizes many notions from the distribution-based approaches, and enjoys better computational complexity than the depth-based approaches for larger values of k. Later in section 3, we will discuss in detail how their notion is different from the notion of local outliers proposed in ... |

25 | A Fast Computer Intrusion Detection Algorithm based on Hypothesis Testing of Command Transition Probabilities”,
- DuMouchel, Schonlau
- 1998
(Show Context)
Citation Context ...s different from the notion of local outliers proposed in this paper. In [17] the notion of distance based outliers is extended by using the distance to the k-nearest neighbor to rank the outliers. A very efficient algorithms to compute the top n outliers in this ranking is given, but their notion of an outlier is still distance-based. Given the importance of the area, fraud detection has received more attention than the general area of outlier detection. Depending on the specifics of the application domains, elaborate fraud models and fraud detection algorithms have been developed (e.g. [8], [6]). In contrast to fraud detection, the kinds of outlier detection work discussed so far are more exploratory in nature. Outlier detection may indeed lead to the construction of fraud models. Finally, most clustering algorithms, especially those developed in the context of KDD (e.g. CLARANS [15], DBSCAN [7], BIRCH [23], STING [22], WaveCluster [19], DenClue [11], CLIQUE [3]), are to some extent capable of handling exceptions. However, since the main objective of a clustering algorithm is to find clusters, they are developed to optimize clustering, and not to optimize outlier detection. The exce... |

13 | A Linear Method for Deviation Detection - Arning, Agrawal, et al. - 1996 |

2 | Schek Hans-J., Blott S.: “A Quantitative Analysis and Performance Study for Similarity-Search Methods - Weber - 1998 |

1 | P.: “Knowledge and Data Mining: Towards a Unifying Framework - Fayyad, Piatetsky-Shapiro, et al. - 1996 |