Results 1  10
of
77
1 Parallel Spectral Clustering in Distributed Systems
"... Spectral clustering algorithms have been shown to be more effective in finding clusters than some traditional algorithms such as kmeans. However, spectral clustering suffers from a scalability problem in both memory use and computational time when the size of a data set is large. To perform cluster ..."
Abstract

Cited by 63 (1 self)
 Add to MetaCart
(Show Context)
Spectral clustering algorithms have been shown to be more effective in finding clusters than some traditional algorithms such as kmeans. However, spectral clustering suffers from a scalability problem in both memory use and computational time when the size of a data set is large. To perform clustering on large data sets, we investigate two representative ways of approximating the dense similarity matrix. We compare one approach by sparsifying the matrix with another by the Nyström method. We then pick the strategy of sparsifying the matrix via retaining nearest neighbors and investigate its parallelization. We parallelize both memory use and computation on distributed computers. Through
A bayesian hierarchical topic model for political texts: Measuring expressed agendas in senate press releases
 In Proceedings of the First Workshop on Social Media Analytics, SOMA ’10
"... Political scientists lack methods to efficiently measure the priorities political actors emphasize in statements. To address this limitation, I introduce a statistical model that attends to the structure of political rhetoric when measuring expressed priorities: statements are naturally organized b ..."
Abstract

Cited by 61 (4 self)
 Add to MetaCart
(Show Context)
Political scientists lack methods to efficiently measure the priorities political actors emphasize in statements. To address this limitation, I introduce a statistical model that attends to the structure of political rhetoric when measuring expressed priorities: statements are naturally organized by author. The expressed agenda model exploits this structure to simultaneously estimate the topics in the texts, as well as the attention political actors allocate to the estimated topics. I apply the method to a collection of over 64,000 press releases from senators from 20052007, which I demonstrate is an ideal medium to measure how senators explain their work in Washington to constituents. A set of examples validates the estimated priorities and demonstrates that the additional information included in the model provides better classification than expert human coders or statistical models for clustering that ignore the author of a document. The statistical model and its extensions will be made available in a forthcoming free software package for the R computing language and the press release data will be made available for download. ∗PhD Candidate, Harvard University Department of Government. I thank the Center for American Political Studies
Generative modelbased document clustering: a comparative study
 Knowledge and Information Systems
, 2005
"... Semisupervised learning has become an attractive methodology for improving classification models and is often viewed as using unlabeled data to aid supervised learning. However, it can also be viewed as using labeled data to help clustering, namely, semisupervised clustering. Viewing semisupervis ..."
Abstract

Cited by 50 (0 self)
 Add to MetaCart
(Show Context)
Semisupervised learning has become an attractive methodology for improving classification models and is often viewed as using unlabeled data to aid supervised learning. However, it can also be viewed as using labeled data to help clustering, namely, semisupervised clustering. Viewing semisupervised learning from a clustering angle is useful in practical situations when the set of labels available in labeled data are not complete, i.e., unlabeled data contain new classes that are not present in labeled data. This paper analyzes several multinomial modelbased semisupervised document clustering methods under a principled modelbased clustering framework. The framework naturally leads to a deterministic annealing extension of existing semisupervised clustering approaches. We compare three (slightly) different semisupervised approaches for clustering documents: Seeded damnl, Constrained damnl, and Feedbackbased damnl, where damnl stands for multinomial modelbased deterministic annealing algorithm. The first two are extensions of the seeded kmeans and constrained kmeans algorithms studied by Basu et al. (2002); the last one is motivated by Cohn et al. (2003). Through empirical experiments on text datasets, we show that: (a) deterministic annealing can often significantly improve the performance of semisupervised clustering; (b) the constrained approach is the best when available labels are complete whereas the feedbackbased approach excels when available labels are incomplete.
Combined keyframe extraction and objectbased video segmentation
 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY
, 2005
"... Video segmentation has been an important and challenging issue for many video applications. Usually there are two different video segmentation approaches, i.e., shotbased segmentation that uses a set of keyframes to represent a video shot and objectbased segmentation that partitions a video sho ..."
Abstract

Cited by 22 (3 self)
 Add to MetaCart
Video segmentation has been an important and challenging issue for many video applications. Usually there are two different video segmentation approaches, i.e., shotbased segmentation that uses a set of keyframes to represent a video shot and objectbased segmentation that partitions a video shot into objects and background. Representing a video shot at different semantic levels, two segmentation processes are usually implemented separately or independently for video analysis. In this paper, we propose a new approach to combine two video segmentation techniques together. Specifically, a combined keyframe extraction and objectbased segmentation method is developed based stateoftheart video segmentation algorithms and statistical clustering approaches. On the one hand, shotbased segmentation can dramatically facilitate and enhance objectbased segmentation by using keyframe extraction to select a few keyframes for statistical model training. On the other hand, objectbased segmentation can be used to improve shotbased segmentation results by using modelbased keyframe refinement. The proposed approach is able to integrate advantages of these two segmentation methods and provide a new combined shotbased and objectbased framework for a variety of advanced video analysis tasks. Experimental results validate effectiveness and flexibility of the proposed video segmentation algorithm.
MOSAIC: A proximity graph approach for agglomerative clustering
 IN: THE 9TH INTL. CONF. ON DATA WAREHOUSING AND KNOWLEDGE DISCOVERY
, 2007
"... Representativebased clustering algorithms are quite popular due to their relative high speed and because of their sound theoretical foundation. On the other hand, the clusters they can obtain are limited to convex shapes and clustering results are also highly sensitive to initializations. In this ..."
Abstract

Cited by 17 (12 self)
 Add to MetaCart
Representativebased clustering algorithms are quite popular due to their relative high speed and because of their sound theoretical foundation. On the other hand, the clusters they can obtain are limited to convex shapes and clustering results are also highly sensitive to initializations. In this paper, a novel agglomerative clustering algorithm called MOSAIC is proposed which greedily merges neighboring clusters maximizing a given fitness function. MOSAIC uses Gabriel graphs to determine which clusters are neighboring and approximates nonconvex shapes as the unions of small clusters that have been computed using a representativebased clustering algorithm. The experimental results show that this technique leads to clusters of higher quality compared to running a representative clustering algorithm standalone. Given a suitable fitness function, MOSAIC is able to detect arbitrary shape clusters. In addition, MOSAIC is capable of dealing with high dimensional data.
Clustering processes
"... The problem of clustering is considered, for the case when each data point is a sample generated by a stationary ergodic process. We propose a very natural asymptotic notion of consistency, and show that simple consistent algorithms exist, under most general nonparametric assumptions. The notion of ..."
Abstract

Cited by 17 (14 self)
 Add to MetaCart
The problem of clustering is considered, for the case when each data point is a sample generated by a stationary ergodic process. We propose a very natural asymptotic notion of consistency, and show that simple consistent algorithms exist, under most general nonparametric assumptions. The notion of consistency is as follows: two samples should be put into the same cluster if and only if they were generated by the same distribution. With this notion of consistency, clustering generalizes such classical statistical problems as homogeneity testing and process classification. We show that, for the case of a known number of clusters, consistency can be achieved under the only assumption that the joint distribution of the data is stationary ergodic (no parametric or Markovian assumptions, no assumptions of independence, neither between nor within the samples). If the number of clusters is unknown, consistency can be achieved under appropriate assumptions on the mixing rates of the processes. In both cases we give examples of simple (at most quadratic in each argument) algorithms which are consistent. 1.
Online Clustering of Processes
"... The problem of online clustering is considered in the case where each data point is a sequence generated by a stationary ergodic process. Data arrive in an online fashion so that the sample received at every timestep is either a continuation of some previously received sequence or a new sequence. Th ..."
Abstract

Cited by 13 (12 self)
 Add to MetaCart
(Show Context)
The problem of online clustering is considered in the case where each data point is a sequence generated by a stationary ergodic process. Data arrive in an online fashion so that the sample received at every timestep is either a continuation of some previously received sequence or a new sequence. The dependence between the sequences can be arbitrary. No parametric or independence assumptions are made; the only assumption is that the marginal distribution of each sequence is stationary and ergodic. A novel, computationally efficient algorithm is proposed and is shown to be asymptotically consistent (under a natural notion of consistency). The performance of the proposed algorithm is evaluated on simulated data, as well as on real datasets (motion classification). 1
Integrating recommendation models for improved web page prediction accuracy
 ThirtyFirst Australasian Computer Science Conference (ACSC’08
, 2008
"... Recent research initiatives have addressed the need for improved performance of Web page prediction accuracy that would profit many applications, ebusiness in particular. Different Web usage mining frameworks have been implemented for this purpose specifically Association rules, clustering, and Mar ..."
Abstract

Cited by 12 (0 self)
 Add to MetaCart
(Show Context)
Recent research initiatives have addressed the need for improved performance of Web page prediction accuracy that would profit many applications, ebusiness in particular. Different Web usage mining frameworks have been implemented for this purpose specifically Association rules, clustering, and Markov model. Each of these frameworks has its own strengths and weaknesses and it has been proved that using each of these frameworks individually does not provide a suitable solution that answers today’s Web page prediction needs. This paper endeavors to provide an improved Web page prediction accuracy by using a novel approach that involves integrating clustering, association rules and Markov models according to some constraints. Experimental results prove that this integration provides better prediction accuracy than using each technique individually.
Learning using the Born rule
, 2006
"... In Quantum Mechanics the transition from a deterministic description to a probabilistic one is done using a simple rule termed the Born rule. This rule states that the probability of an outcome (a) given a state (Ψ) is the square of their inner products ((a ⊤ Ψ) 2). In this paper, we will explore th ..."
Abstract

Cited by 9 (0 self)
 Add to MetaCart
(Show Context)
In Quantum Mechanics the transition from a deterministic description to a probabilistic one is done using a simple rule termed the Born rule. This rule states that the probability of an outcome (a) given a state (Ψ) is the square of their inner products ((a ⊤ Ψ) 2). In this paper, we will explore the use of the Bornrulebased probabilities for clustering, feature selection, classification, and for comparison between sets. We show how these probabilities lead to existing and new algebraic algorithms for which no other complete probabilistic justification is known.
Clustering Time series from Mixture Polynomial Models with Discretised Data
 Proceedings of 2 nd Australian Data Mining Workshop
, 2003
"... ..."