#### DMCA

## A Generalized Maximum Entropy Approach to Bregman Co-clustering and Matrix Approximation (2004)

### Cached

### Download Links

Venue: | In KDD |

Citations: | 135 - 29 self |

### Citations

5407 | Convex Analysis
- ROCKAFELLAR
- 1970
(Show Context)
Citation Context ... 1967; Censor and Zenios, 1998), which form a large class of well-behaved loss functions with a number of desirable properties. Definition 1 Let φ be a real-valued convex function of Legendre type 2 (=-=Rockafellar, 1970-=-; Banerjee et al., 2005b) defined on the convex set S ≡ dom(φ) (⊆ R d ). The Bregman divergence dφ : S × ri(S) ↦→ R+ is defined as where ∇φ is the gradient of φ. dφ(z1,z2) = φ(z1) − φ(z2) − 〈z1 − z2,∇... |

2797 | Algorithms for clustering data - JAIN, DUBES - 1988 |

1546 | Grouplens: An open architecture for collaborative filtering of netnews
- Resnick, Iacovou, et al.
- 1994
(Show Context)
Citation Context ...nd 20% of the ratings as the test data in each run. Table 6 shows the mean absolute error (MAE) obtained using various existing collaborative filtering approaches (Sarwar et al., 2000; Hofmann, 2004; =-=Resnick et al., 1994-=-) as well as the coclustering approach based on squared Euclidean distance. From the table, we note that the coclustering method based on C5 provides accuracy comparable to that of the SVD and NNMF-ba... |

1243 | Algorithms for non-negative matrix factorization
- Seung, Lee
(Show Context)
Citation Context ...urrence and contingency tables, since SVD-based decompositions are difficult to interpret, which is necessary for data mining applications. Alternative techniques involving non-negativity constraints =-=[11]-=- using KLdivergence as the approximation loss function [10, 11] have been proposed. However, these approaches apply to special types of matrices. A general formulation that is both interpretable and a... |

1171 |
Information theory and statistical mechanics
- Jaynes
- 1957
(Show Context)
Citation Context ...an information while the available information is provided by the linear constraints that preserve the specified statistics. As the following examples show, the widely used maximum entropy principle (=-=Jaynes, 1957-=-; Cover and Thomas, 1991) and standard least squares principles (Csiszár, 1991) can be obtained as special cases of the MBI principle. Example 1.E From Example 1.B, we observe that the Bregman informa... |

598 | Biclustering of expression data.
- Cheng, Church
- 2000
(Show Context)
Citation Context ...ect Descriptors: I.2.6 [Artificial Intelligence]: Learning General Terms: Algorithms Keywords: Co-clustering, Matrix Approximation, Bregman divergences 1. INTRODUCTION Co-clustering, or bi-clustering =-=[9, 4]-=-, is the problem of simultaneously clustering rows and columns of a data matrix. The problem of co-clustering arises in diverse data mining applications, such as simultaneous clustering of genes and e... |

494 |
The Relaxation Method of Finding the Common Point of Convex Sets and its Application to the Solution
- Bregman
- 1967
(Show Context)
Citation Context ...al objective function, and (ii) providing a new class of desirable matrix approximations. • Our formulation is applicable to all Bregman divergences (Azoury and Warmuth, 2001; Banerjee et al., 2005b; =-=Bregman, 1967-=-; Censor and Zenios, 1998), which constitute a large class of distortion measures including the most commonly used ones such as squared Euclidean distance, KL-divergence, Itakura-Saito distance, etc. ... |

468 | Co-clustering documents and words using bipartite spectral graph partitioning. - Dhillon - 2001 |

442 | 2005): “Clustering with Bregman divergences
- Banerjee, Merugu, et al.
(Show Context)
Citation Context ...existing literature [4, 5, 8, 10] that illustrate the usefulness of particular instances of our Bregman co-clustering framework. In fact, a large class of parametric partitional clustering algorithms =-=[2]-=- including kmeans can be shown to be special cases of the proposed framework wherein only rows or only columns are being clustered. In recent years, co-clustering has been successfully applied to vari... |

376 |
Parallel optimization: Theory, algorithms and applications
- Censor, Zenios
- 1997
(Show Context)
Citation Context ... reconstruction schemes for these co-clustering algorithms? We show that alternate minimization based co-clustering algorithms work for a large class of distortion measures called Bregman divergences =-=[3]-=-, which include squared Euclidean distance, KL-divergence, Itakura-Saito distance, etc., as special cases. Further, we demonstrate that for a given co-clustering, a large variety of approximation mode... |

346 | Information-theoretic coclustering,” in
- Dhillon, Mallela, et al.
- 2003
(Show Context)
Citation Context ... problem of co-clustering arises in diverse data mining applications, such as simultaneous clustering of genes and experimental conditions in bioinformatics [4, 5], documents and words in text mining =-=[8], us-=-ers and movies in recommender systems, etc. In order to design a co-clustering framework, we need to first characterize the “goodness” of a co-clustering. Existing co-clustering techniques [5, 4, ... |

331 | Latent semantic models for collaborative filtering,” - Hofmann - 2004 |

323 | Latent semantic indexing: A probabilistic analysis,”
- Papadimitriou, Raghavan, et al.
- 1998
(Show Context)
Citation Context ...akes the co-clustering methods more widely applicable than traditional matrix approximation methods based on singular value decomposition. In particular, classical singular value decomposition (SVD) (=-=Papadimitriou et al., 1998-=-) based approaches to matrix approximation are quite often inappropriate for certain data matrices such as co-occurrence and contingency tables as singular vectors can have negative entries and the co... |

323 | Application of dimensionality reduction in recommender system-a case study.
- Sarwar, Karypis, et al.
- 2000
(Show Context)
Citation Context ...ing ratings can be predicted using suitable low parameter approximations of the ratings matrix, a number of other collaborative filtering approaches based on matrix approximation methods such as SVD (=-=Sarwar et al., 2000-=-), and PLSI (Hofmann, 2004) have been proposed in recent years. Following the same general intuition, we propose a mathematically well-motivated solution based on co-clustering. The main idea is to (i... |

280 | Axiomatic derivation of the principle of maximum entropy & the principle of minimum cross-entropy.
- Shore, Johnson
- 1980
(Show Context)
Citation Context ...t q(X,Y ) has a number of desirable properties. For instance, given the row, column and co-cluster marginals, it is the unique distribution that satisfies certain consistency criteria (Csiszár, 1991; =-=Shore and Johnson, 1980-=-). In Section 4, we also demonstrate that it is the optimal approximation to the original distribution p in terms of KL-divergence among all multiplicative combinations of the preserved marginals. It ... |

259 | Logistic regression, AdaBoost and Bregman distances.
- Collins, Schapire, et al.
- 2002
(Show Context)
Citation Context ... our setting. Related analyses have appeared in the literature in the context of incremental learning of generalized entropy functionals (Lafferty, 1999), convergence analysis of boosting algorithms (=-=Collins et al., 2000-=-), and game theoretic interpretation of Bayesian decision theory (Grünwald and Dawid, 2004). However, we present our analysis using co-clustering semantics for ease of exposition. Further, our analysi... |

242 |
Why least squares and maximum entropy? An axiomatic approach to inference for linear inverse problems,”
- Csiszar
- 1991
(Show Context)
Citation Context ...ial case of the proposed principle for I-divergence since the entropy of a joint distribution is negatively related to the Bregman information (Example 1.B). In fact, even the least squares principle =-=[7]-=- can be obtained as a special case when the Bregman divergence is squared Euclidean distance. The following theorem characterizes the solution to the minimum Bregman information problem (2.6). For a p... |

237 |
Direct clustering of a data matrix.
- Hartigan
- 1972
(Show Context)
Citation Context ...use of its applications to problems in microarray analysis [4, 5] and text mining [8]. In fact, there exist many formulations of the co-clustering problem such as the hierarchical co-clustering model =-=[9]-=-, the bi-clustering model [4] that involves finding the best co-clusters one at a time, etc. In this paper, we have focussed on the partitional co-clustering formulation first introduced in [9]. Matri... |

138 | A Divisive Information-Theoretic Feature Clustering Algorithm for Text Classification, - Dhillon, Mallela, et al. - 2003 |

133 | Spectral biclustering of microarray data: coclustering genes and conditions. - Klugar, Basri, et al. - 2003 |

122 | Unsupervised learning from dyadic data.
- Hofmann, Puzicha
- 2009
(Show Context)
Citation Context ...ation solution and the row and column cluster update steps can then be obtained from the optimal Lagrange multipliers. 4. EXPERIMENTS There are a number of experimental results in existing literature =-=[4, 5, 8, 10]-=- that illustrate the usefulness of particular instances of our Bregman co-clustering framework. In fact, a large class of parametric partitional clustering algorithms [2] including kmeans can be shown... |

118 | Game theory, maximum entropy, minimum discrepancy and robust Bayesian decision theory. - Grunwald, Dawid - 2004 |

116 | Minimum sum-squared residue co-clustering of gene expression data,” in
- Cho, Dhillon, et al.
- 2004
(Show Context)
Citation Context ...ing, matrix approximation and learning based on Bregman divergences. Co-clustering has been a topic of much interest in the recent years because of its applications to problems in microarray analysis =-=[4, 5]-=- and text mining [8]. In fact, there exist many formulations of the co-clustering problem such as the hierarchical co-clustering model [9], the bi-clustering model [4] that involves finding the best c... |

98 | Fully automatic cross-associations,” in
- Chakrabarti, Papadimitriou, et al.
- 2004
(Show Context)
Citation Context ...rs, the Bregman co-clustering algorithm corresponding to the relative entropy-based cost function automatically seeks to find an optimal (minimum length) lossless code for the matrix. A recent paper (=-=Chakrabarti et al., 2004-=-) follows a similar co-clustering based approach using binary relative entropy and basis C2 for performing lossless coding of binary valued matrices. To demonstrate the effectiveness of the co-cluster... |

69 | Legendre functions and the method of random Bregman projections.
- Bauschke, Borwein
- 1997
(Show Context)
Citation Context ...(25) and (26) are based on only one linear constraint. For convergence to the optimum, the updates must touch upon all the constraints following a schedule known as relaxation control (Bregman, 1967; =-=Bauschke and Borowein, 1997-=-). For simplicity, we consider updates based on a cyclic ordering of the constraints, where all constraints are considered one after the other. The cyclic ordering schedule is sufficient to guarantee ... |

69 | A Scalable Collaborative Filtering Framework Based on Co-Clustering. In:
- George, Merugu
- 2005
(Show Context)
Citation Context ... 2000; Cho et al., 2004; Kluger et al., 2003), tokens and contexts in natural language processing (Freitag, 2004; Rohwer and Freitag, 2004; Li and Abe, 1998), users and movies in recommender systems (=-=George and Merugu, 2005-=-), etc. Co-clustering is desirable over traditional “single-sided” clustering from a number of perspectives: 1. Simultaneous grouping of row and column clusters is more informative and digestible. Coc... |

50 | Consistent bipartite graph co-partitioning for star-structured high-order heterogeneous data co-clustering,"
- Gao, Liu, et al.
- 2005
(Show Context)
Citation Context ...e inter-related. Co-clustering has recently received a lot of attention in several practical applications such as simultaneous clustering of documents and words in text mining (Dhillon et al., 2003b; =-=Gao et al., 2005-=-; Takamura and Matsumoto, 2003), genes and experimental conditions in bioinformatics (Cheng and Church, 2000; Cho et al., 2004; Kluger et al., 2003), tokens and contexts in natural language processing... |

48 |
A general model for clustering binary data.
- Li
- 2005
(Show Context)
Citation Context ...pping co-clusters where the quality of co-clusters is determined in terms of an appropriate cost function. Recently, quite a few algorithms (Cho et al., 2004; Dhillon et al., 2003b; Li and Abe, 1998; =-=Li, 2005-=-) have been proposed to address the above partitional problem for various cost functions based on squared Euclidean distance and I-divergence. One of the objectives of the current work is to generaliz... |

47 | On the optimality of conditional expectation as a Bregman predictor,”
- Banerjee, Guo, et al.
- 2005
(Show Context)
Citation Context ...mize a well-defined global objective function, and (ii) providing a new class of desirable matrix approximations. • Our formulation is applicable to all Bregman divergences (Azoury and Warmuth, 2001; =-=Banerjee et al., 2005-=-b; Bregman, 1967; Censor and Zenios, 1998), which constitute a large class of distortion measures including the most commonly used ones such as squared Euclidean distance, KL-divergence, Itakura-Saito... |

41 | Additive models, boosting and inference for generalized divergences.
- Lafferty
- 1999
(Show Context)
Citation Context ...projection results of Della Pietra et al. (2001) also apply to our setting. Related analyses have appeared in the literature in the context of incremental learning of generalized entropy functionals (=-=Lafferty, 1999-=-), convergence analysis of boosting algorithms (Collins et al., 2000), and game theoretic interpretation of Bayesian decision theory (Grünwald and Dawid, 2004). However, we present our analysis using ... |

35 |
Trained named entity recognition using distributional clusters.
- Freitag
- 2004
(Show Context)
Citation Context ...Takamura and Matsumoto, 2003), genes and experimental conditions in bioinformatics (Cheng and Church, 2000; Cho et al., 2004; Kluger et al., 2003), tokens and contexts in natural language processing (=-=Freitag, 2004-=-; Rohwer and Freitag, 2004; Li and Abe, 1998), users and movies in recommender systems (George and Merugu, 2005), etc. Co-clustering is desirable over traditional “single-sided” clustering from a numb... |

25 | Clustering of bipartite advertiser-keyword graphs.
- Carrasco, Fain, et al.
- 2003
(Show Context)
Citation Context ...natural language processing (Freitag, 2004; Rohwer and Freitag, 2004; Li and Abe, 1998), bio-informatics (Cheng and Church, 2000; Cho et al., 2004; Kluger et al., 2003) as well as other applications (=-=Carrasco et al., 2003-=-). In particular, there exist a number of empirical studies that illustrate the usefulness of particular instances of the Bregman co-clustering framework that we describe in this paper. Hence, instead... |

16 | Image and feature co-clustering
- Qiu
(Show Context)
Citation Context ...been successfully applied to a number of application domains such as text mining (Dhillon et al., 2003b; Gao et al., 2005; Takamura and Matsumoto, 2003), image and video analysis (Zhong et al., 2004; =-=Qiu, 2004-=-; Guan et al., 2005; Cai et al., 2005), natural language processing (Freitag, 2004; Rohwer and Freitag, 2004; Li and Abe, 1998), bio-informatics (Cheng and Church, 2000; Cho et al., 2004; Kluger et al... |

12 |
Towards full automation of lexicon construction,"
- Rohwer, Freitag
- 2004
(Show Context)
Citation Context ...tsumoto, 2003), genes and experimental conditions in bioinformatics (Cheng and Church, 2000; Cho et al., 2004; Kluger et al., 2003), tokens and contexts in natural language processing (Freitag, 2004; =-=Rohwer and Freitag, 2004-=-; Li and Abe, 1998), users and movies in recommender systems (George and Merugu, 2005), etc. Co-clustering is desirable over traditional “single-sided” clustering from a number of perspectives: 1. Sim... |

7 |
Co-clustering for text categorization
- Takamura, Matsumoto
- 2003
(Show Context)
Citation Context ...o-clustering has recently received a lot of attention in several practical applications such as simultaneous clustering of documents and words in text mining (Dhillon et al., 2003b; Gao et al., 2005; =-=Takamura and Matsumoto, 2003-=-), genes and experimental conditions in bioinformatics (Cheng and Church, 2000; Cho et al., 2004; Kluger et al., 2003), tokens and contexts in natural language processing (Freitag, 2004; Rohwer and Fr... |

6 | Unsupervised auditory scene categorization via key audio effects and information-theoretic co-clustering
- Cai, Lu, et al.
- 2005
(Show Context)
Citation Context ... number of application domains such as text mining (Dhillon et al., 2003b; Gao et al., 2005; Takamura and Matsumoto, 2003), image and video analysis (Zhong et al., 2004; Qiu, 2004; Guan et al., 2005; =-=Cai et al., 2005-=-), natural language processing (Freitag, 2004; Rohwer and Freitag, 2004; Li and Abe, 1998), bio-informatics (Cheng and Church, 2000; Cho et al., 2004; Kluger et al., 2003) as well as other application... |

6 | Spectral images and features coclustering with application to content-based image retrieval
- Guan, Qie, et al.
- 2005
(Show Context)
Citation Context ...sfully applied to a number of application domains such as text mining (Dhillon et al., 2003b; Gao et al., 2005; Takamura and Matsumoto, 2003), image and video analysis (Zhong et al., 2004; Qiu, 2004; =-=Guan et al., 2005-=-; Cai et al., 2005), natural language processing (Freitag, 2004; Rohwer and Freitag, 2004; Li and Abe, 1998), bio-informatics (Cheng and Church, 2000; Cho et al., 2004; Kluger et al., 2003) as well as... |

1 | Table 17: Notation used in the paper - Azoury, Warmuth |

1 | Multi-way distributional clustering via pairwise interactions - BANERJEE, GHOSH, et al. - 2005 |

1 | Erlbaum Assoc., 2003. GroupLens. Movielens data set. http://www.cs.umn.edu/Research/GroupLens/data/ml-data.tar.gz - Lawrence |

1 | Word clustering and disambiguation based on co-occurence data - BANERJEE, GHOSH, et al. - 1998 |

1 |
The American Journal of Human Genetics, 75:850–861
- Madeira, Oliveira
- 2004
(Show Context)
Citation Context ...xpression levels. To address this problem, a number of co-clustering configurations (e.g., overlapping, partitional) and loss functions based on additive and multiplicative models have been proposed (=-=Madeira and Oliveira, 2004-=-). These methods have been shown to be quite effective for identifying highly correlated genes and conditions. In particular, a special case of the Bregman co-clustering (Cheng and Church, 2000; Cho e... |

1 | Subspace clustering for high dimensinal data: A review - Parsons, Haque, et al. |