Results 1  10
of
39
Graph mining: laws, generators, and algorithms
 ACM COMPUT SURV (CSUR
, 2006
"... How does the Web look? How could we tell an abnormal social network from a normal one? These and similar questions are important in many fields where the data can intuitively be cast as a graph; examples range from computer networks to sociology to biology and many more. Indeed, any M: N relation in ..."
Abstract

Cited by 132 (7 self)
 Add to MetaCart
How does the Web look? How could we tell an abnormal social network from a normal one? These and similar questions are important in many fields where the data can intuitively be cast as a graph; examples range from computer networks to sociology to biology and many more. Indeed, any M: N relation in database terminology can be represented as a graph. A lot of these questions boil down to the following: “How can we generate synthetic but realistic graphs? ” To answer this, we must first understand what patterns are common in realworld graphs and can thus be considered a mark of normality/realism. This survey give an overview of the incredible variety of work that has been done on these problems. One of our main contributions is the integration of points of view from physics, mathematics, sociology, and computer science. Further, we briefly describe recent advances on some related and interesting graph problems.
An Improved Approximation Algorithm for the Column Subset Selection Problem
"... We consider the problem of selecting the “best ” subset of exactly k columns from an m × n matrix A. In particular, we present and analyze a novel twostage algorithm that runs in O(min{mn 2, m 2 n}) time and returns as output an m × k matrix C consisting of exactly k columns of A. In the first stag ..."
Abstract

Cited by 74 (13 self)
 Add to MetaCart
(Show Context)
We consider the problem of selecting the “best ” subset of exactly k columns from an m × n matrix A. In particular, we present and analyze a novel twostage algorithm that runs in O(min{mn 2, m 2 n}) time and returns as output an m × k matrix C consisting of exactly k columns of A. In the first stage (the randomized stage), the algorithm randomly selects O(k log k) columns according to a judiciouslychosen probability distribution that depends on information in the topk right singular subspace of A. In the second stage (the deterministic stage), the algorithm applies a deterministic columnselection procedure to select and return exactly k columns from the set of columns selected in the first stage. Let C be the m × k matrix containing those k columns, let PC denote the projection matrix onto the span of those columns, and let Ak denote the “best ” rankk approximation to the matrix A as computed with the singular value decomposition. Then, we prove that ‖A − PCA‖2 ≤ O k 3 4 log 1
Unsupervised Feature Selection for Principal Components Analysis [Extended Abstract]
"... Principal Components Analysis (PCA) is the predominant linear dimensionality reduction technique, and has been widely applied on datasets in all scientific domains. We consider, both theoretically and empirically, the topic of unsupervised feature selection for PCA, by leveraging algorithms for the ..."
Abstract

Cited by 26 (8 self)
 Add to MetaCart
(Show Context)
Principal Components Analysis (PCA) is the predominant linear dimensionality reduction technique, and has been widely applied on datasets in all scientific domains. We consider, both theoretically and empirically, the topic of unsupervised feature selection for PCA, by leveraging algorithms for the socalled Column Subset Selection Problem (CSSP). In words, the CSSP seeks the“best”subset of exactly k columns from an m×n data matrix A, and has been extensively studied in the Numerical Linear Algebra community. We present a novel twostage algorithm for the CSSP. From a theoretical perspective, for small to moderate values of k, this algorithm significantly improves upon the best previouslyexisting results [24, 12] for the CSSP. From an empirical perspective, we evaluate this algorithm as an unsupervised feature selection strategy in three application domains of modern statistical data analysis: finance, documentterm data, and genetics. We pay particular attention to how this algorithm may be used to select representative or landmark features from an objectfeature matrix in an unsupervised manner. In all three application domains, we are able to identify k landmark features, i.e., columns of the data matrix, that capture nearly the same amount of information as does the subspace that is spanned by the top k “eigenfeatures.”
Cluster Structure Inference Based on Clustering Stability with Applications to Microarray Data Analysis
 EURASIP JOURNAL ON APPLIED SIGNAL PROCESSING 2004:1, 64–80
, 2004
"... This paper focuses on the stabilitybased approach for estimating the number of clusters K in microarray data. The cluster stability approach amounts to performing clustering successively over random subsets of the available data and evaluating an index which expresses the similarity of the successi ..."
Abstract

Cited by 9 (0 self)
 Add to MetaCart
This paper focuses on the stabilitybased approach for estimating the number of clusters K in microarray data. The cluster stability approach amounts to performing clustering successively over random subsets of the available data and evaluating an index which expresses the similarity of the successive partitions obtained. We present a method for automatically estimating K by starting from the distribution of the similarity index. We investigate how the selection of the hierarchical clustering (HC) method, respectively, the similarity index, influences the estimation accuracy. The paper introduces a new similarity index based on a partition distance. The performance of the new index and that of other wellknown indices are experimentally evaluated by comparing the “true” data partition with the partition obtained at each level of an HC tree. A case study is conducted with a publicly available Leukemia dataset.
MRI Brain Image Segmentation by Fuzzy Symmetry Based Genetic Clustering Technique
 IEEE Congress on Evolutionary Computation
, 2007
"... Abstract—In this paper, an automatic segmentation technique of multispectral magnetic resonance image of the brain using a new fuzzy point symmetry based genetic clustering technique is proposed. The proposed realcoded variable string length genetic fuzzy clustering technique (FuzzyVGAPS) is able ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
(Show Context)
Abstract—In this paper, an automatic segmentation technique of multispectral magnetic resonance image of the brain using a new fuzzy point symmetry based genetic clustering technique is proposed. The proposed realcoded variable string length genetic fuzzy clustering technique (FuzzyVGAPS) is able to evolve the number of clusters present in the data set automatically. Here assignment of points to different clusters are made based on the point symmetry based distance rather than the Euclidean distance. The cluster centers are encoded in the chromosomes, whose value may vary. A newly developed fuzzy point symmetry based cluster validity index, FSymindex, is used as a measure of ‘goodness ’ of the corresponding partition. This validity index is able to correctly indicate presence of clusters of different sizes as long as they are internally symmetrical. A Kdtree based data structure is used to reduce the complexity of computing the symmetry distance. The proposed method is applied on several simulated T1weighted, T2weighted and proton density normal and MS lesion magnetic resonance brain images. Superiority of the proposed method over Fuzzy
Data Size Reduction for Clusteringbased Binning of ICs Using Principal Component Analysis
 Proc. IEEE Int. Workshop on Current and Defect Based Testing, 2005
"... Accurate binning of ICs using analog characteristics such as IDDQ requires using data from a number of vectors. From this data, information needs to be extracted using a method that will yield sufficiently high resolution. Using a large volume of data can require significant computation time. If n a ..."
Abstract

Cited by 4 (4 self)
 Add to MetaCart
(Show Context)
Accurate binning of ICs using analog characteristics such as IDDQ requires using data from a number of vectors. From this data, information needs to be extracted using a method that will yield sufficiently high resolution. Using a large volume of data can require significant computation time. If n analog measurements are made for each chip, the data has n dimensions. However the measured IDDQ values for a chip can be highly correlated. We examine an approach based on Principal Component Analysis (PCA) for reducing the data size while preserving almost all of the information. PCA transforms the data by extracting statistically independent components and arranging them in the order of relative significance. Using industrial IDDQ data we found that often ndimensional data can be reduced to a single dimension with no substantial change in the clusters identified.
Mjolsness E: Clustering analysis of microarray gene expression data by splitting algorithm
 J Parallel Distrib Comput
"... A clustering method based on recursive bisection is introduced for analyzing microarray gene expression data. Either or both dimensions for the genes and the samples of a given microarray dataset can be classi£ed in an unsupervised fashion. Alternatively, if certain prior knowledge of the genes or s ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
A clustering method based on recursive bisection is introduced for analyzing microarray gene expression data. Either or both dimensions for the genes and the samples of a given microarray dataset can be classi£ed in an unsupervised fashion. Alternatively, if certain prior knowledge of the genes or samples is available, a supervised version of the clustering analysis can also be carried out. Either approach may be used to generate a partial or complete binary hierarchy, the dendrogram, showing the underlying structure of the dataset. Compared to other existing clustering methods used for microarray data analysis (such as the hierarchical, Kmeans, and selforganizing map methods), the method presented here has the advantage of much improved computational ef£ciency while retaining effective separation of data clusters under a distance metric, a straightforward parallel implementation, and useful extraction and presentation of biological information. Clustering results of both synthesized and experimental microarray data are presented to demonstrate the performance of the algorithm. 1
SensorEmbedded Teeth for Oral Activity Recognition
"... This paper presents the design and implementation of a wearable oral sensory system that recognizes human oral activities, such as chewing, drinking, speaking, and coughing. We conducted an evaluation of this oral sensory system in a laboratory experiment involving 8 participants. The results show 9 ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
This paper presents the design and implementation of a wearable oral sensory system that recognizes human oral activities, such as chewing, drinking, speaking, and coughing. We conducted an evaluation of this oral sensory system in a laboratory experiment involving 8 participants. The results show 93.8 % oral activity recognition accuracy when using a persondependent classifier and 59.8% accuracy when using a personindependent classifier. Author Keywords
S.: A New Multiobjective Simulated Annealing Based Clustering Technique Using Stability And Symmetry
 In: 19th International Conference on Pattern Recognition. (2008
"... Most clustering algorithms operate by optimizing (either implicitly or explicitly) a single measure of cluster solution quality. Such methods may perform well on some data sets but lack robustness with respect to variations in cluster shape, proximity, evenness and so forth. In this paper, we have p ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
Most clustering algorithms operate by optimizing (either implicitly or explicitly) a single measure of cluster solution quality. Such methods may perform well on some data sets but lack robustness with respect to variations in cluster shape, proximity, evenness and so forth. In this paper, we have proposed a multiobjective clustering technique which optimizes simultaneously two objectives, one reflecting the total symmetry present in the data set and the other reflecting the stability of the obtained partitions over different bootstrap samples of the data set. The proposed algorithm utilizes a recently developed simulated annealing based multiobjective optimization technique, AMOSA, as the underlying optimization method. Here assignment of points to different clusters are done based on the point symmetry based distance rather than the Euclidean distance. Results on several artificial and reallife data sets show that the proposed technique is wellsuited to detect the number of clusters from data sets having point symmetric clusters.