Results 1  10
of
134
CBSA: contentbased soft annotation for multimodal image retrieval using Bayes point machines
 IEEE Transactions on Circuits and Systems for Video Technology
, 2003
"... ..."
A monte carlo algorithm for fast projective clustering
 In Proceedings of the 2002 ACM SIGMOD International conference on Management of data
, 2002
"... We propose a mathematical formulation for the notion of optimal projective cluster, starting from natural requirements on the density of points in subspaces. This allows us to develop a Monte Carlo algorithm for iteratively computing projective clusters. We prove that the computed clusters are good ..."
Abstract

Cited by 102 (1 self)
 Add to MetaCart
(Show Context)
We propose a mathematical formulation for the notion of optimal projective cluster, starting from natural requirements on the density of points in subspaces. This allows us to develop a Monte Carlo algorithm for iteratively computing projective clusters. We prove that the computed clusters are good with high probability. We implemented a modified version of the algorithm, using heuristics to speed up computation. Our extensive experiments show that our method is significantly more accurate than previous approaches. In particular, we use our techniques to build a classifier for detecting rotated human faces in cluttered images. 1. PROJECTIVE CLUSTERING Clustering is a widely used technique for data mining, indexing, and classification. Many practical methods proposed in the last few years, such as CLARANS [11], BIRCH [15], DBSCAN [5, 6], and
Ontologybased Text Clustering
 IN PROCEEDINGS OF THE IJCAI2001 WORKSHOP “TEXT LEARNING: BEYOND SUPERVISION
, 2001
"... Text clustering typically involves clustering in a high dimensional space, which appears difficult with regard to virtually all practical settings. In addition, given a particular clustering result it is typically very hard to come up with a good explanation of why the text clusters have been ..."
Abstract

Cited by 43 (10 self)
 Add to MetaCart
Text clustering typically involves clustering in a high dimensional space, which appears difficult with regard to virtually all practical settings. In addition, given a particular clustering result it is typically very hard to come up with a good explanation of why the text clusters have been constructed the way they are. In this paper, we propose a new approach for applying background knowledge during preprocessing in order to improve clustering results and allow for selection between results. We built various views basing our selection of text features on a heterarchy of concepts. Based on these aggregations, we compute multiple clustering results using KMeans. The results may be distinguished and explained by the corresponding selection of concepts in the ontology. Our results compare favourably with a sophisticated baseline preprocessing strategy.
Range nearestneighbor query
 IEEE Transactions on Knowledge and Data Engineering (TKDE
"... A range nearestneighbor (RNN) query retrieves the nearest neighbor (NN) for every point in a range. It is a natural generalization of point and continuous nearestneighbor queries and has many applications. In this paper, we consider the ranges as (hyper)rectangles and propose efficient inmemory ..."
Abstract

Cited by 40 (2 self)
 Add to MetaCart
(Show Context)
A range nearestneighbor (RNN) query retrieves the nearest neighbor (NN) for every point in a range. It is a natural generalization of point and continuous nearestneighbor queries and has many applications. In this paper, we consider the ranges as (hyper)rectangles and propose efficient inmemory processing and secondary memory pruning techniques for RNN queries in both 2D and highdimensional spaces. These techniques are generalized for kRNN queries, which return the k nearest neighbors for every point in the range. In addition, we devise an auxiliary solutionbased index EXOtree to speed up any type of NN query. EXOtree is orthogonal to any existing NN processing algorithm and thus can be transparently integrated. An extensive empirical study was conducted to evaluate the CPU and I/O performance of these techniques, and the study showed that they are efficient and robust under various datasets, query ranges, numbers of nearest neighbors, dimensions, and cache sizes.
PerfExplorer: A Performance Data Mining Framework for LargeScale Parallel Computing
 In Proceedings of SC 2005 conference, ACM
, 2005
"... Parallel applications running on highend computer systems manifest a complexity of performance phenomena. Tools to observe parallel performance attempt to capture these phenomena in measurement datasets rich with information relating multiple performance metrics to execution dynamics and parameters ..."
Abstract

Cited by 35 (9 self)
 Add to MetaCart
(Show Context)
Parallel applications running on highend computer systems manifest a complexity of performance phenomena. Tools to observe parallel performance attempt to capture these phenomena in measurement datasets rich with information relating multiple performance metrics to execution dynamics and parameters specific to the applicationsystem experiment. However, the potential size of datasets and the need to assimilate results from multiple experiments makes it a daunting challenge to not only process the information, but discover and understand performance insights. In this paper, we present PerfExplorer, a framework for parallel performance data mining and knowledge discovery. The framework architecture enables the development and integration of data mining operations that will be applied to largescale parallel performance profiles. PerfExplorer operates as a clientserver system and is built on a robust parallel performance database (PerfDMF) to access the parallel profiles and save its analysis results. Examples are given demonstrating these techniques for performance analysis of ASCI applications. 1.
Towards Systematic Design of Distance Functions for Data Mining Applications
, 2003
"... Distance function computation is a key subtask in many data mining algorithms and applications. The most effective form of the distance function can only be expressed in the context of a particular data domain. It is also often a challenging and nontrivial task to find the most effective form of th ..."
Abstract

Cited by 33 (6 self)
 Add to MetaCart
Distance function computation is a key subtask in many data mining algorithms and applications. The most effective form of the distance function can only be expressed in the context of a particular data domain. It is also often a challenging and nontrivial task to find the most effective form of the distance function. For example, in the text domain, distance function design has been considered such an important and complex issue that it has been the focus of intensive research over three decades. The final design of distance functions in this domain has been reached only by detailed empirical testing and consensus over the quality of results provided by the different variations. With the increasing ability to collect data in an automated way, the number of new kinds of data continues to increase rapidly. This makes it increasingly difficult to undertake such efforts for each and every new data type. The most important aspect of distance function design is that since a human is the enduser for any application, the design must satisfy the user requirements with regard to effectiveness. This creates the need for a systematic framework to design distance functions which are sensitive to the particular characteristics of the data domain. In this paper, we discuss such a framework. The goal is to create distance functions in an automated way while minimizing the work required from the user. We will show that this framework creates distance functions which are significantly more effective than popularly used functions such as the Euclidean metric.
The curse of dimensionality in data mining and time series prediction
 Computational Intelligence and Bioinspired Systems, Lecture Notes in Computer Science 3512
, 2005
"... www.ucl.ac.be/mlg Abstract. Modern data analysis tools have to work on highdimensional data, whose components are not independently distributed. Highdimensional spaces show surprising, counterintuitive geometrical properties that have a large influence on the performances of data analysis tools. ..."
Abstract

Cited by 32 (0 self)
 Add to MetaCart
(Show Context)
www.ucl.ac.be/mlg Abstract. Modern data analysis tools have to work on highdimensional data, whose components are not independently distributed. Highdimensional spaces show surprising, counterintuitive geometrical properties that have a large influence on the performances of data analysis tools. Among these properties, the concentration of the norm phenomenon results in the fact that Euclidean norms and Gaussian kernels, both commonly used in models, become inappropriate in highdimensional spaces. This papers presents alternative distance measures and kernels, together with geometrical methods to decrease the dimension of the space. The methodology is applied to a typical time series prediction example. 1
AngleBased Outlier Detection in Highdimensional Data
"... Detectingoutliersinalargesetofdataobjectsisamajor data mining task aiming at finding different mechanisms responsible for different groups of objects in a data set. All existing approaches, however, are based on an assessment of distances (sometimes indirectly by assuming certain distributions) in t ..."
Abstract

Cited by 31 (8 self)
 Add to MetaCart
(Show Context)
Detectingoutliersinalargesetofdataobjectsisamajor data mining task aiming at finding different mechanisms responsible for different groups of objects in a data set. All existing approaches, however, are based on an assessment of distances (sometimes indirectly by assuming certain distributions) in the fulldimensional Euclidean data space. In highdimensional data, these approaches are bound to deteriorate due to the notorious “curse of dimensionality”. In this paper, we propose a novel approach named ABOD (AngleBased Outlier Detection) and some variants assessing the variance in the angles between the difference vectors of a point to the other points. This way, the effects of the “curse of dimensionality ” are alleviated compared to purely distancebased approaches. A main advantage of our new approach is that our method does not rely on any parameter selection influencing the quality of the achieved ranking. In a thorough experimental evaluation, we compare ABOD to the wellestablished distancebased method LOF for various artificial and a real world data set and show ABOD to perform especially well on highdimensional data.
Density connected clustering with local subspace preferences
 In ICDM ’04: Proceedings of the Fourth IEEE International Conference on Data Mining
, 2004
"... Many clustering algorithms tend to break down in highdimensional feature spaces, because the clusters often exist only in specific subspaces (attribute subsets) of the original feature space. Therefore, the task of projected clustering (or subspace clustering) has been defined recently. As a novel ..."
Abstract

Cited by 31 (9 self)
 Add to MetaCart
(Show Context)
Many clustering algorithms tend to break down in highdimensional feature spaces, because the clusters often exist only in specific subspaces (attribute subsets) of the original feature space. Therefore, the task of projected clustering (or subspace clustering) has been defined recently. As a novel solution to tackle this problem, we propose the concept of local subspace preferences, which captures the main directions of high point density. Using this concept we adopt densitybased clustering to cope with highdimensional data. In particular, we achieve the following advantages over existing approaches: Our proposed method has a determinate result, does not depend on the order of processing, is robust against noise, performs only one single scan over the database, and is linear in the number of dimensions. A broad experimental evaluation shows that our approach yields results of significantly better quality than recent work on clustering highdimensional data. 1.
Instance Selection Techniques for MemoryBased Collaborative Filtering
 in Proceedings of the 2nd SIAM International Conference on Data Mining
, 2002
"... Abstract: Collaborative filtering (CF) has become an important data mining technique to make personalized recommendations for books, web pages or movies, etc. One popular algorithm is the memorybased collaborative filtering, which predicts a user’s preference based on his or her similarity to other ..."
Abstract

Cited by 29 (2 self)
 Add to MetaCart
Abstract: Collaborative filtering (CF) has become an important data mining technique to make personalized recommendations for books, web pages or movies, etc. One popular algorithm is the memorybased collaborative filtering, which predicts a user’s preference based on his or her similarity to other users (instances) in the database. However, the tremendous growth of users and the large number of products, memorybased CF algorithms results in the problem of deciding the right instances to use during prediction, in order to reduce executive cost and excessive storage, and possibly to improve the generalization accuracy by avoiding noise and overfitting. In this paper, we focus our work on a typical user preference database that contains many missing values, and propose four novel instance reduction techniques called TURF1TURF4 as a preprocessing step to improve the efficiency and accuracy of the memorybased CF algorithm. The key idea is to generate prediction from a carefully selected set of relevant instances. We evaluate the techniques on the wellknown EachMovie data set. Our experiments showed that the proposed algorithms not just dramatically speed up the prediction, but also improved the accuracy. 1