Results 11 - 20
of
30
Agglomerative Mean-Shift Clustering via Query Set Compression ∗
"... Mean-Shift (MS) is a powerful non-parametric clustering method. Although good accuracy can be achieved, its computational cost is particularly expensive even on moderate data sets. In this paper, for the purpose of algorithm speedup, we develop an agglomerative MS clustering method called Agglo-MS, ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
(Show Context)
Mean-Shift (MS) is a powerful non-parametric clustering method. Although good accuracy can be achieved, its computational cost is particularly expensive even on moderate data sets. In this paper, for the purpose of algorithm speedup, we develop an agglomerative MS clustering method called Agglo-MS, along with its mode-seeking ability and convergence property analysis. Our method is built upon an iterative query set compression mechanism which is motivated by the quadratic bounding optimization nature of MS. The whole framework can be efficiently implemented in linear running time complexity. Furthermore, we show that the pairwise constraint information can be naturally integrated into our framework to derive a semi-supervised non-parametric clustering method. Extensive experiments on toy and real-world data sets validate the speedup advantage and numerical accuracy of our method, as well as the superiority of its semi-supervised version. 1
Optimizing Temporal Topic Segmentation for Intelligent Text Visualization
"... Figure 1. TIARA-generated visual text summary showing four topic layers. We are building a topic-based, interactive visual analytic tool that aids users in analyzing large collections of text. To help users quickly discover content evolution and significant content transitions within a topic over ti ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
(Show Context)
Figure 1. TIARA-generated visual text summary showing four topic layers. We are building a topic-based, interactive visual analytic tool that aids users in analyzing large collections of text. To help users quickly discover content evolution and significant content transitions within a topic over time, here we present a novel, constraint-based approach to temporal topic segmentation. Our solution splits a discovered topic into multiple linear, non-overlapping sub-topics along a timeline by satisfying a diverse set of semantic, temporal, and visualization constraints simultaneously. For each derived subtopic, our solution also automatically selects a set of representative keywords to summarize the main content of the sub-topic. Our extensive evaluation, including a crowdsourced user study, demonstrates the effectiveness of our method over an existing baseline. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.
SHIFTR: A Fast and Scalable System for Ad Hoc Sensemaking of Large Graphs
"... We present SHIFTR, a system that assists users in making sense of large scale graph data. Making sense of information represented as large graphs is a fundamental challenge in many data-intensive domains. We suggest the potential of strong synergies between the data mining, cognitive psychology, and ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
(Show Context)
We present SHIFTR, a system that assists users in making sense of large scale graph data. Making sense of information represented as large graphs is a fundamental challenge in many data-intensive domains. We suggest the potential of strong synergies between the data mining, cognitive psychology, and HCI communities in matching powerful graph mining tools with insights into how people learn and interact with information, and here we present SHIFTR as one such application. SHIFTR adapts the Belief Propagation algorithm to target important sensemaking tasks such as flexibly reorganizing graph entities into multiple groups based on both positive and negative examples. SHIFTR scales linearly with the graph size through its fast algorithm, novel mList data structure, and externalization of graph meta data. We demonstrate SHIFTR’s usage and benefits through real-world sensemaking scenarios using the DBLP dataset that has almost 2 million author-publication relationships. A demo video of SHIFTR can be downloaded at
Language Modelling of Constraints for Text Clustering
"... Abstract. Constrained clustering is a recently presented family of semi-supervised learning algorithms. These methods use domain information to impose constraints over the clustering output. The way in which those constraints (typically pair-wise constraints between documents) are in-troduced is by ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
Abstract. Constrained clustering is a recently presented family of semi-supervised learning algorithms. These methods use domain information to impose constraints over the clustering output. The way in which those constraints (typically pair-wise constraints between documents) are in-troduced is by designing new clustering algorithms that enforce the ac-complishment of the constraints. In this paper we present an alternative approach for constrained clustering where, instead of defining new algo-rithms or objective functions, the constraints are introduced modifying the document representation by means of their language modelling. More precisely the constraints are modelled using the well-known Relevance Models successfully used in other retrieval tasks such as pseudo-relevance feedback. To the best of our knowledge this is the first attempt to try such approach. The results show that the presented approach is an ef-fective method for constrained clustering even improving the results of existing constrained clustering algorithms. 1
MULTIPLICATIVE UPDATE RULES FOR NONNEGATIVE MATRIX FACTORIZATION WITH CO-OCCURRENCE CONSTRAINTS
"... Nonnegative matrix factorization (NMF) is a widely-used tool for obtaining low-rank approximations of nonnegative data such as digital images, audio signals, textual data, financial data, and more. One disadvantage of the basic NMF formulation is its inability to control the amount of dependence amo ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
Nonnegative matrix factorization (NMF) is a widely-used tool for obtaining low-rank approximations of nonnegative data such as digital images, audio signals, textual data, financial data, and more. One disadvantage of the basic NMF formulation is its inability to control the amount of dependence among the learned dictionary atoms. Enforcing dependence within predetermined groups of atoms allows objects to be represented using multiple atoms instead of only one atom. In this paper, we introduce three simple and convenient multiplicative update rules for NMF that enforce dependence among atoms. Using examples in music transcription, we demonstrate the ability of these updates to represent each musical note with multiple atoms and cluster the atoms for source separation purposes. Index Terms — Dictionary learning, sparse coding, music transcription, source separation. 1.
The Advantage of Careful Imputation Sources in Sparse Data-Environment of Recommender Systems: Generating Improved SVD-based Recommendations
, 2013
"... Recommender systems apply machine learning and data mining techniques for filtering unseen information and can predict whether a user would like a given item. The main types of recommender systems namely collaborative filtering and content-based filtering suffer from scalability, data sparsity, and ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
Recommender systems apply machine learning and data mining techniques for filtering unseen information and can predict whether a user would like a given item. The main types of recommender systems namely collaborative filtering and content-based filtering suffer from scalability, data sparsity, and cold-start problems resulting in poor quality recommendations and reduced coverage. There has been some work in the literature to increase the scalability by reducing the dimensions of the recommender system dataset using singular value decomposition (SVD); however, due to sparsity it results in inaccurate recommendations. In this paper, we show how a careful selection of an imputation source in singular value decomposition based recommender system can provide potential benefits ranging from cost saving, to performance enhancement. The proposed missing value imputation methods have the ability to exploit any underlying data correlation structures and hence have been proven to exhibit much superior accuracy and performance as compared to the traditional missing value imputation strategy—item average of the user-item rating matrix—that has been the preferred approach in the literature to resolve this problem. By extensive experimental results on three different dataset, we show that the proposed approaches outperform traditional one and moreover, they provide better recommendation under new user cold-start problem, new item cold-start problem, long tail problem, and sparse conditions.
Non-negative Matrix Factorizations for Clustering: A Survey
"... Recently there has been significant development in the use of non-negative matrix factorization (NMF) methods for various clustering tasks. NMF factorizes an input nonnegative matrix into two nonnegative matrices of lower rank. Although NMF can be used for conventional data analysis, the recent over ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Recently there has been significant development in the use of non-negative matrix factorization (NMF) methods for various clustering tasks. NMF factorizes an input nonnegative matrix into two nonnegative matrices of lower rank. Although NMF can be used for conventional data analysis, the recent overwhelming interest in NMF is due to the newly discovered ability of NMF to solve challenging data mining and machine learning problems. In particular, NMF with the sum of squared error cost function is equivalent to a relaxed K-means clustering, the most widely used unsupervised learning algorithm. In addition, NMF with the I-divergence cost function is equivalent to probabilistic latent semantic indexing, another unsupervised learning method popularly used in text analysis. Many other data mining and machine learning problems can be reformulated as an NMF problem. This chapter aims to provide a comprehensive review of non-negative matrix factorization methods for clustering. In particular, we outline the theoretical foundations on NMF for clustering, provide an overview of different variants on NMF formulations, and examine
CLUSTERING
"... Abstract—In this paper, we propose a novel constrained coclustering method to achieve two goals. First, we combine informationtheoretic coclustering and constrained clustering to improve clustering performance. Second, we adopt both supervised and unsupervised constraints to demonstrate the effectiv ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract—In this paper, we propose a novel constrained coclustering method to achieve two goals. First, we combine informationtheoretic coclustering and constrained clustering to improve clustering performance. Second, we adopt both supervised and unsupervised constraints to demonstrate the effectiveness of our algorithm. The unsupervised constraints are automatically derived from existing knowledge sources, thus saving the effort and cost of using manually labeled constraints. To achieve our first goal, we develop a two-sided hidden Markov random field (HMRF) model to represent both document and word constraints. We then use an alternating expectation maximization (EM) algorithm to optimize the model. We also propose two novel methods to automatically construct and incorporate document and word constraints to support unsupervised constrained clustering: 1) automatically construct document constraints based on overlapping named entities (NE) extracted by an NE extractor; 2) automatically construct word constraints based on their semantic distance inferred from WordNet. The results of our evaluation over two benchmark data sets demonstrate the superiority of our approaches against a number of existing approaches. Index Terms—Constrained clustering, coclustering, unsupervised constraints, text clustering Ç
Collective Matrix Factorization of Predictors, Neighborhood and Targets for Semi-Supervised Classification
"... Abstract. Due to the small size of available labeled data for semi-supervised learning, approaches to this problem make strong assump-tions about the data, performing well only when such assumptions hold true. However, a lot of effort may have to be spent in understanding the data so that the most s ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract. Due to the small size of available labeled data for semi-supervised learning, approaches to this problem make strong assump-tions about the data, performing well only when such assumptions hold true. However, a lot of effort may have to be spent in understanding the data so that the most suitable model can be applied. This process can be as critical as gathering labeled data. One way to overcome this hindrance is to control the contribution of different assumptions to the model, rendering it capable of performing reasonably in a wide range of applications. In this paper we propose a collective matrix factorization model that simultaneously decomposes the predictor, neighborhood and target matrices (PNT-CMF) to achieve semi-supervised classification. By controlling how strongly the model relies on different assumptions, PNT-CMF is able to perform well on a wider variety of datasets. Experiments on synthetic and real world datasets show that, while state-of-the-art models (TSVM and LapSVM) excel on datasets that match their char-acteristics and have a performance drop on the others, our approach outperforms them being consistently competitive in different situations.
Little Is Much: Bridging Cross-Platform Behaviors through Overlapped Crowds
"... People often use multiple platforms to fulfill their dif-ferent information needs. With the ultimate goal of serv-ing people intelligently, a fundamental way is to get comprehensive understanding about user needs. How to organically integrate and bridge cross-platform in-formation in a human-centric ..."
Abstract
- Add to MetaCart
People often use multiple platforms to fulfill their dif-ferent information needs. With the ultimate goal of serv-ing people intelligently, a fundamental way is to get comprehensive understanding about user needs. How to organically integrate and bridge cross-platform in-formation in a human-centric way is important. Exist-ing transfer learning assumes either fully-overlapped or non-overlapped among the users. However, the real case is the users of different platforms are partially overlapped. The number of overlapped users is often small and the explicitly known overlapped users is even less due to the lacking of unified ID for a user across different platforms. In this paper, we propose a novel semi-supervised transfer learning method to ad-dress the problem of cross-platform behavior predic-tion, called XPTRANS. To alleviate the sparsity issue, it fully exploits the small number of overlapped crowds to optimally bridge a user’s behaviors in different plat-forms. Extensive experiments across two real social net-works show that XPTRANS significantly outperforms the state-of-the-art. We demonstrate that by fully ex-ploiting 26 % overlapped users, XPTRANS can predict the behaviors of non-overlapped users with the same accuracy as overlapped users, which means the small overlapped crowds can successfully bridge the informa-tion across different platforms.